topic Re: Clarification needed on eval split() function in Splunk Search

Clarification needed on eval split() function

malvidin — Fri, 03 Jul 2020 07:39:26 GMT

For the following search command, what is the expected output?

| makeresults | eval text_string = "I:red_heart:Splunk" | eval text_split = split(text_string, "")

I would expect a text_split field that either contains an array like this:

text_split == [ 'I', '❤️', 'S', 'p', 'l', 'u', 'n', 'k' ]

or if split by byte, potentially dependent on the locale:

text_split == [ 'I', 'â', '', '¤', 'ï', '¸', '¿', 'S', 'p', 'l', 'u', 'n', 'k' ]

But not the current output, were the data :

text_split == [ 'I', '�', '�', '�', '�', '�', '�', 'S', 'p', 'l', 'u', 'n', 'k' ]

The use of characters that aren't fixed width also screws up search entry highlighting and text selection, but that isn't related to the split function.

| eval text_string = "I:red_heart:Splunk" `comment("Try highlighting a word in this comment in the SPL Editor")`

It looks like mvjoin() reverses the split(), but mvcombine fails.

(edit attempt failed to add the red heart back to the code samples; replaced with :red_heart:)

Re: Clarification needed on eval split() function

bowesmana — Thu, 02 Jul 2020 23:13:56 GMT

Interesting find - not surprising that split does not work with certain Unicode code points correctly, I imagine that's a fairly rare edge case when dealing with Splunked data ❤️

I guess both the split handling and the editor are bugs, as

| eval t=text_string | eval tl=len(t) | rex field=t mode=sed "s/❤️/_LuuuV_/"

both the length of 9 is correctly counting the two Unicode code points and rex replaces it correctly (less surprising).

You might expect that split() should give the two Unicode code points as separate split_text values, the first with the black heart and the second with some other (unknown) character, but the fact that it's converting it to 6 values, indicates it's misinterpreting the UTF8.

Re: Clarification needed on eval split() function

malvidin — Fri, 03 Jul 2020 07:59:25 GMT

Because mvjoin() reverses the operation, the back end data does not appear to be lost. And since it is split into 6 characters, it appears that the back end data is being parsed as UTF8.

The second Unicode character in the red heart emoji is variation selector 16 (U+FE0F).

Using rex splits selects by character, but split() selects by UTF8 byte.

| rex field=text_string max_match=0 "(?P<text_split>.)"

Re: Clarification needed on eval split() function

to4kawa — Sat, 04 Jul 2020 23:58:47 GMT

| makeresults | eval text ="I❤️Splunk" | rex field=text max_match=0 "(?<text_split>[\w\p{S}])"

| makeresults | eval text ="I".printf("%c",tonumber("2764",16)).printf("%c",tonumber("FE0F",16))."Splunk" | rex field=text max_match=0 "(?<text_split>[\w\p{S}])"

That's very interesting. ❤️ is multibyte. \p{S} is match single unicode.
How can I match the multibyte unicode(e.g. emoji )?

Re: Clarification needed on eval split() function

malvidin — Tue, 07 Jul 2020 16:02:06 GMT

I don't think you can match on multiple character emoji. Separating by UTF8 byte (split) or by Unicode character (rex), Splunk only has to look at whether the codepoint is valid.

There are entire projects out there that build the regex based on the current Unicode definition. It is possible that you could create an app that would periodically update.

https://github.com/mathiasbynens/emoji-regex

You could recommend it at https://ideas.splunk.com/

Re: Clarification needed on eval split() function

to4kawa — Wed, 08 Jul 2020 09:19:37 GMT

| makeresults | eval text ="I".printf("%c",tonumber("2764",16)).printf("%c",tonumber("FE0F",16))."Splunk" | rex field=text max_match=0 "(?<text_split>\w|\p{S}.)"

Hi @malvidin I could.

Re: Clarification needed on eval split() function

malvidin — Thu, 09 Jul 2020 18:09:44 GMT

Based on your response, I think this just gets more complicated depending on how many Emoji we want to keep together.

| makeresults | eval text ="I ".printf("%c",tonumber("2764",16)).printf("%c",tonumber("FE0F",16))." Splunk & " + printf("%c",tonumber("1F469",16)) + printf("%c",tonumber("1F3FB",16)) + printf("%c",tonumber("200D",16)) + printf("%c",tonumber("1F468",16)) + printf("%c",tonumber("1F3FD",16)) + printf("%c",tonumber("200D",16)) + printf("%c",tonumber("1F467",16)) + printf("%c",tonumber("1F3FF",16)) + " & " + printf("%c",tonumber("1F441",16)) + printf("%c",tonumber("FE0F",16)) + printf("%c",tonumber("200D",16)) + printf("%c",tonumber("1F5E8",16)) + printf("%c",tonumber("FE0F",16)) | rex field=text max_match=0 "(?<text_split>\p{So}[\x{1F3FB}-\x{1F3FF}]?(?:\x{200D}\p{So}[\x{1F3FB}-\x{1F3FF}]?(?:\x{200D}\p{So}[\x{1F3FB}-\x{1F3FF}]?)|[\x{FE00}-\x{FE0F}])|\p{So}[\x{1F3FB}-\x{1F3FF}]|.)"