Clarification needed on eval split() function

malvidin · ‎07-02-2020

For the following search command, what is the expected output?

| makeresults
| eval text_string = "I:red_heart:Splunk"
| eval text_split = split(text_string, "")

I would expect a text_split field that either contains an array like this:

text_split == [ 'I', '❤️', 'S', 'p', 'l', 'u', 'n', 'k' ]

or if split by byte, potentially dependent on the locale:

text_split == [ 'I', 'â', '', '¤', 'ï', '¸', '¿', 'S', 'p', 'l', 'u', 'n', 'k' ]

But not the current output, were the data :

text_split == [ 'I', '�', '�', '�', '�', '�', '�', 'S', 'p', 'l', 'u', 'n', 'k' ]

The use of characters that aren't fixed width also screws up search entry highlighting and text selection, but that isn't related to the split function.

| eval text_string = "I:red_heart:Splunk"  `comment("Try highlighting a word in this comment in the SPL Editor")`

It looks like mvjoin() reverses the split(), but mvcombine fails.

(edit attempt failed to add the red heart back to the code samples; replaced with :red_heart:)

to4kawa · ‎07-04-2020

| makeresults
| eval text ="I❤️Splunk"
| rex field=text max_match=0 "(?<text_split>[\w\p{S}])"

| makeresults
| eval text ="I".printf("%c",tonumber("2764",16)).printf("%c",tonumber("FE0F",16))."Splunk"
| rex field=text max_match=0 "(?<text_split>[\w\p{S}])"

That's very interesting. ❤️ is multibyte. \p{S} is match single unicode.
How can I match the multibyte unicode(e.g. emoji )?

malvidin · ‎07-07-2020

I don't think you can match on multiple character emoji. Separating by UTF8 byte (split) or by Unicode character (rex), Splunk only has to look at whether the codepoint is valid.

There are entire projects out there that build the regex based on the current Unicode definition. It is possible that you could create an app that would periodically update.

https://github.com/mathiasbynens/emoji-regex

You could recommend it at https://ideas.splunk.com/

to4kawa · ‎07-08-2020

| makeresults
| eval text ="I".printf("%c",tonumber("2764",16)).printf("%c",tonumber("FE0F",16))."Splunk"
| rex field=text max_match=0 "(?<text_split>\w|\p{S}.)"

Hi @malvidin I could.

malvidin · ‎07-09-2020

Based on your response, I think this just gets more complicated depending on how many Emoji we want to keep together.

| makeresults 
| eval text ="I ".printf("%c",tonumber("2764",16)).printf("%c",tonumber("FE0F",16))." Splunk & " 
    + printf("%c",tonumber("1F469",16)) 
    + printf("%c",tonumber("1F3FB",16)) 
    + printf("%c",tonumber("200D",16)) 
    + printf("%c",tonumber("1F468",16)) 
    + printf("%c",tonumber("1F3FD",16)) 
    + printf("%c",tonumber("200D",16)) 
    + printf("%c",tonumber("1F467",16)) 
    + printf("%c",tonumber("1F3FF",16)) 
    + " & "
    + printf("%c",tonumber("1F441",16)) 
    + printf("%c",tonumber("FE0F",16)) 
    + printf("%c",tonumber("200D",16)) 
    + printf("%c",tonumber("1F5E8",16)) 
    + printf("%c",tonumber("FE0F",16)) 
| rex field=text max_match=0 "(?<text_split>\p{So}[\x{1F3FB}-\x{1F3FF}]?(?:\x{200D}\p{So}[\x{1F3FB}-\x{1F3FF}]?(?:\x{200D}\p{So}[\x{1F3FB}-\x{1F3FF}]?)|[\x{FE00}-\x{FE0F}])|\p{So}[\x{1F3FB}-\x{1F3FF}]|.)"

bowesmana · ‎07-02-2020

Interesting find - not surprising that split does not work with certain Unicode code points correctly, I imagine that's a fairly rare edge case when dealing with Splunked data ❤️

I guess both the split handling and the editor are bugs, as

| eval t=text_string
| eval tl=len(t)
| rex field=t mode=sed "s/❤️/_LuuuV_/"

both the length of 9 is correctly counting the two Unicode code points and rex replaces it correctly (less surprising).

You might expect that split() should give the two Unicode code points as separate split_text values, the first with the black heart and the second with some other (unknown) character, but the fact that it's converting it to 6 values, indicates it's misinterpreting the UTF8.

malvidin · ‎07-03-2020

Because mvjoin() reverses the operation, the back end data does not appear to be lost. And since it is split into 6 characters, it appears that the back end data is being parsed as UTF8.

The second Unicode character in the red heart emoji is variation selector 16 (U+FE0F).

Using rex splits selects by character, but split() selects by UTF8 byte.

| rex field=text_string max_match=0 "(?P<text_split>.)"

Clarification needed on eval split() function

eval

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Splunk Developers: Construct Your Future at the .conf26 Builder Bar

Quick connection discovery mode for forwarders

Join the Conversation

Clarification needed on eval split() function

eval

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Splunk Developers: Construct Your Future at the .conf26 Builder Bar

Quick connection discovery mode for forwarders