rex sed strings different length

faguilar · ‎05-25-2018

Hi!

Can somebody please explain me WTF is happening here?
My question is quite simple. I want to substitute [áéíóú] for [aeiou], using one single rex (anywhere on the string, but making a direct match between á and a, é and é, and so on. Like "José Ramón González" will be "Jose Ramon Gonzalez"
I already know how to do that with 5 regex and using a string replace. But I need to do that using one single rex (you can using sed without any problems).
I found out that in sed mode, doing this: y/àéíóú/aeiou/ (transliteration in sed) you can do that perfectly (you can try sed y/àéíóú/aeiou/ on the linux terminal).
However, the magic comes in Splunk. I have this Splunk regex:

| rex mode=sed field=name2 "y/á/a/"

And the result (in Splunk 6.3.1 and 7.1.1) is:

Error in 'rex' command: Failed to initialize sed. 'á' and 'a' are different length.

Ok... WTF!? Hoever I decided to try something like this:

| rex mode=sed field=name2 "y/á/aa/"

And the result is this one:

WTF!?? I think is a encoding thing (UTF-8 to UTF-16) but I don't know how to solve this.
Can somebody please help me? Is there a way to explicitlly tell splunk the encoding I'm using and I want to use in the regex? I already have defined the extraction as UTF-8. Why does this works perfectly in linux, but not in Splunk??
As you can check here: http://docs.splunk.com/Documentation/Splunk/6.3.1/SearchReference/rex Splunk supports that /y sed subsitution.

Thank you

darrenfuller · ‎05-28-2018

Can't think of a way to do it in a single pass, but this works:

| makeresults | eval data="Jûán Pérëz Ä Žîs Çópú Ö'ñó", origdata=data
| rex field="data" mode=sed "s/[ÀÁÂÃÄ]/A/g"
| rex field="data" mode=sed "s/[Ç]/C/g"
| rex field="data" mode=sed "s/[ÈÉÊË]/E/g"
| rex field="data" mode=sed "s/[Ñ]/N/g"
| rex field="data" mode=sed "s/[ÒÓÔÕÖ]/O/g"
| rex field="data" mode=sed "s/[Š]/S/g"
| rex field="data" mode=sed "s/[ÙÚÛÜ]/U/g"
| rex field="data" mode=sed "s/[ÝŸ]/Y/g"
| rex field="data" mode=sed "s/[Ž]/Z/g"
| rex field="data" mode=sed "s/[àáâãäª]/a/g"
| rex field="data" mode=sed "s/[ç]/c/g"
| rex field="data" mode=sed "s/[èéêë]/e/g"
| rex field="data" mode=sed "s/[ìíîï]/i/g"
| rex field="data" mode=sed "s/[ñ]/n/g"
| rex field="data" mode=sed "s/[òóôöõº]/o/g"
| rex field="data" mode=sed "s/[ùúûü]/u/g"
| rex field="data" mode=sed "s/[ýÿ]/y/g"
| rex field="data" mode=sed "s/[š]/s/g"
| rex field="data" mode=sed "s/[ž]/z/g"

Output:

_time 2018-05-28 13:52:34
origdata Jûán Pérëz Ä Žîs Çópú Ö'ñó
data Juan Perez A Zis Copu O'no

faguilar · ‎05-29-2018

Thanks for the answer @darrenfuller, but I already know how to do it like you suggest. I need to do it in a single line, using the transliteration like in sed mode y/.

mayurr98 · ‎05-25-2018

It's working at my end. must be a syntax problem.

| makeresults 
| eval data="àéíóú" 
| rex field=data mode=sed "s\àéíóú\aeiou\g"

FrankVl · ‎05-25-2018

Or a difference in character encoding settings of your splunk web / browser / os?

If I type à in notepad++ document set as UTF-8 it also says: length 2, compared to length = 1 for a. If I open a fresh notepad++ window set to ANSI encoding and type the same character à it shows as length 1, so I can imagine in certain cases, splunk will interpret it as a 2 byte character as well and throw that mismatch error?

faguilar · ‎05-28-2018

Hi @mayurr98,

Thank you for your answer, but maybe I expressed my problem on the wrong way.
It's not a syntax problem and I do not need to make that simple substitution (which I already know how to do), that's why I said that I used the sed y/àéíóú/aeiou/ which works for my scenario on the linux terminal.

I want to substitute those characters anywhere in the string, not in that exact order. Meaning that if I have the name

José González

that sed y/àéíóú/aeiou/ will substitute it prefectly, just á for an a, é for a é... and so on.

My problem here is that in splunk, the sed mode doesn't seems to work as the linux sed command.

I will upgrade my question to avoid any ambiguity

faguilar · ‎05-28-2018

For my search of example data:

| makeresults | eval data="Juán Pérez Dís Tópú", data1=data | rex field=data1 mode=sed "y/áéíóú/aaeeiioouu/" | table data*

This is my output:

data --------------------------- data1
Juán Pérez Dís Tópú ----- Juaan Paerez Dais Taopau

And if i use the command | rex field=data1 mode=sed "y/áéíóú/aaeeiioouu/" the result is:

Error in 'rex' command: Failed to initialize sed. 'áéíóú' and 'aeiou' are different length.

rex sed strings different length

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Value Insights: Now Generally Available in the CMC

What’s New in Splunk AI: Volume 02

Splunk App Dev Quarterly Roundup: AI, Agents, and Innovation!

Join the Conversation

rex sed strings different length

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Value Insights: Now Generally Available in the CMC

What’s New in Splunk AI: Volume 02

Splunk App Dev Quarterly Roundup: AI, Agents, and Innovation!