topic Re: rex sed strings different length in Splunk Search

rex sed strings different length

faguilar — Fri, 25 May 2018 09:00:32 GMT

Hi!

Can somebody please explain me WTF is happening here?
My question is quite simple. I want to substitute [áéíóú] for [aeiou], using one single rex (anywhere on the string, but making a direct match between á and a, é and é, and so on. Like "José Ramón González" will be "Jose Ramon Gonzalez"
I already know how to do that with 5 regex and using a string replace. But I need to do that using one single rex (you can using sed without any problems).
I found out that in sed mode, doing this: y/àéíóú/aeiou/ (transliteration in sed) you can do that perfectly (you can try sed y/àéíóú/aeiou/ on the linux terminal).
However, the magic comes in Splunk. I have this Splunk regex:

| rex mode=sed field=name2 "y/á/a/"

And the result (in Splunk 6.3.1 and 7.1.1) is:

Error in 'rex' command: Failed to initialize sed. 'á' and 'a' are different length.

Ok... WTF!? Hoever I decided to try something like this:

| rex mode=sed field=name2 "y/á/aa/"

And the result is this one:

WTF!?? I think is a encoding thing (UTF-8 to UTF-16) but I don't know how to solve this.
Can somebody please help me? Is there a way to explicitlly tell splunk the encoding I'm using and I want to use in the regex? I already have defined the extraction as UTF-8. Why does this works perfectly in linux, but not in Splunk??
As you can check here: http://docs.splunk.com/Documentation/Splunk/6.3.1/SearchReference/rex Splunk supports that /y sed subsitution.

Thank you

Re: rex sed strings different length

mayurr98 — Fri, 25 May 2018 09:50:06 GMT

It's working at my end. must be a syntax problem.

| makeresults 
| eval data="àéíóú" 
| rex field=data mode=sed "s\àéíóú\aeiou\g"

Re: rex sed strings different length

FrankVl — Fri, 25 May 2018 11:30:37 GMT

Or a difference in character encoding settings of your splunk web / browser / os?

If I type à in notepad++ document set as UTF-8 it also says: length 2, compared to length = 1 for a. If I open a fresh notepad++ window set to ANSI encoding and type the same character à it shows as length 1, so I can imagine in certain cases, splunk will interpret it as a 2 byte character as well and throw that mismatch error?

Re: rex sed strings different length

faguilar — Mon, 28 May 2018 09:44:04 GMT

Hi @mayurr98,

Thank you for your answer, but maybe I expressed my problem on the wrong way.
It's not a syntax problem and I do not need to make that simple substitution (which I already know how to do), that's why I said that I used the sed y/àéíóú/aeiou/ which works for my scenario on the linux terminal.

I want to substitute those characters anywhere in the string, not in that exact order. Meaning that if I have the name

José González

that sed y/àéíóú/aeiou/ will substitute it prefectly, just á for an a, é for a é... and so on.

My problem here is that in splunk, the sed mode doesn't seems to work as the linux sed command.

I will upgrade my question to avoid any ambiguity

Re: rex sed strings different length

faguilar — Mon, 28 May 2018 10:09:17 GMT

For my search of example data:

| makeresults | eval data="Juán Pérez Dís Tópú", data1=data | rex field=data1 mode=sed "y/áéíóú/aaeeiioouu/" | table data*

This is my output:

data --------------------------- data1
Juán Pérez Dís Tópú ----- Juaan Paerez Dais Taopau

And if i use the command | rex field=data1 mode=sed "y/áéíóú/aaeeiioouu/" the result is:

Error in 'rex' command: Failed to initialize sed. 'áéíóú' and 'aeiou' are different length.

Re: rex sed strings different length

darrenfuller — Mon, 28 May 2018 17:55:11 GMT

Can't think of a way to do it in a single pass, but this works:

| makeresults | eval data="Jûán Pérëz Ä Žîs Çópú Ö'ñó", origdata=data
| rex field="data" mode=sed "s/[ÀÁÂÃÄ]/A/g"
| rex field="data" mode=sed "s/[Ç]/C/g"
| rex field="data" mode=sed "s/[ÈÉÊË]/E/g"
| rex field="data" mode=sed "s/[Ñ]/N/g"
| rex field="data" mode=sed "s/[ÒÓÔÕÖ]/O/g"
| rex field="data" mode=sed "s/[Š]/S/g"
| rex field="data" mode=sed "s/[ÙÚÛÜ]/U/g"
| rex field="data" mode=sed "s/[ÝŸ]/Y/g"
| rex field="data" mode=sed "s/[Ž]/Z/g"
| rex field="data" mode=sed "s/[àáâãäª]/a/g"
| rex field="data" mode=sed "s/[ç]/c/g"
| rex field="data" mode=sed "s/[èéêë]/e/g"
| rex field="data" mode=sed "s/[ìíîï]/i/g"
| rex field="data" mode=sed "s/[ñ]/n/g"
| rex field="data" mode=sed "s/[òóôöõº]/o/g"
| rex field="data" mode=sed "s/[ùúûü]/u/g"
| rex field="data" mode=sed "s/[ýÿ]/y/g"
| rex field="data" mode=sed "s/[š]/s/g"
| rex field="data" mode=sed "s/[ž]/z/g"

Output:

_time 2018-05-28 13:52:34
origdata Jûán Pérëz Ä Žîs Çópú Ö'ñó
data Juan Perez A Zis Copu O'no

Re: rex sed strings different length

faguilar — Tue, 29 May 2018 10:02:15 GMT

Thanks for the answer @darrenfuller, but I already know how to do it like you suggest. I need to do it in a single line, using the transliteration like in sed mode y/.