Splunk Search

Clarification needed on eval split() function

malvidin
Communicator

For the following search command, what is the expected output?

 

| makeresults
| eval text_string = "I:red_heart:Splunk"
| eval text_split = split(text_string, "")

 

 I would expect a text_split field that either contains an array like this:

text_split == [ 'I', '❤️', 'S', 'p', 'l', 'u', 'n', 'k' ] 

or if  split by byte, potentially dependent on the locale:

text_split == [ 'I', 'â', '', '¤', 'ï', '¸', '¿', 'S', 'p', 'l', 'u', 'n', 'k' ]

But not the current output, were the data :

text_split == [ 'I', '', '', '', '', '', '', 'S', 'p', 'l', 'u', 'n', 'k' ]

The use of characters that aren't fixed width also screws up search entry highlighting and text selection, but that isn't related to the split function.

 

| eval text_string = "I:red_heart:Splunk"  `comment("Try highlighting a word in this comment in the SPL Editor")`

 

 It looks like mvjoin() reverses the split(), but mvcombine fails.

(edit attempt failed to add the red heart back to the code samples; replaced with :red_heart:)

Labels (1)
Tags (2)

to4kawa
Ultra Champion

 

| makeresults
| eval text ="I❤️Splunk"
| rex field=text max_match=0 "(?<text_split>[\w\p{S}])"
| makeresults
| eval text ="I".printf("%c",tonumber("2764",16)).printf("%c",tonumber("FE0F",16))."Splunk"
| rex field=text max_match=0 "(?<text_split>[\w\p{S}])"

That's very interesting. ❤️ is multibyte. \p{S} is match single unicode. 
How can I match the multibyte unicode(e.g. emoji )?

 

 

0 Karma

malvidin
Communicator

I don't think you can match on multiple character emoji. Separating by UTF8 byte (split) or by Unicode character (rex), Splunk only has to look at whether the codepoint is valid.

There are entire projects out there that build the regex based on the current Unicode definition. It is possible that you could create an app that would periodically update.

https://github.com/mathiasbynens/emoji-regex

You could recommend it at https://ideas.splunk.com/

 

 

0 Karma

to4kawa
Ultra Champion

 

| makeresults
| eval text ="I".printf("%c",tonumber("2764",16)).printf("%c",tonumber("FE0F",16))."Splunk"
| rex field=text max_match=0 "(?<text_split>\w|\p{S}.)"

 

Hi @malvidin I could.

0 Karma

malvidin
Communicator

Based on your response, I think this just gets more complicated depending on how many Emoji we want to keep together.

| makeresults 
| eval text ="I ".printf("%c",tonumber("2764",16)).printf("%c",tonumber("FE0F",16))." Splunk & " 
    + printf("%c",tonumber("1F469",16)) 
    + printf("%c",tonumber("1F3FB",16)) 
    + printf("%c",tonumber("200D",16)) 
    + printf("%c",tonumber("1F468",16)) 
    + printf("%c",tonumber("1F3FD",16)) 
    + printf("%c",tonumber("200D",16)) 
    + printf("%c",tonumber("1F467",16)) 
    + printf("%c",tonumber("1F3FF",16)) 
    + " & "
    + printf("%c",tonumber("1F441",16)) 
    + printf("%c",tonumber("FE0F",16)) 
    + printf("%c",tonumber("200D",16)) 
    + printf("%c",tonumber("1F5E8",16)) 
    + printf("%c",tonumber("FE0F",16)) 
| rex field=text max_match=0 "(?<text_split>\p{So}[\x{1F3FB}-\x{1F3FF}]?(?:\x{200D}\p{So}[\x{1F3FB}-\x{1F3FF}]?(?:\x{200D}\p{So}[\x{1F3FB}-\x{1F3FF}]?)|[\x{FE00}-\x{FE0F}])|\p{So}[\x{1F3FB}-\x{1F3FF}]|.)"

 

bowesmana
SplunkTrust
SplunkTrust

Interesting find - not surprising that split does not work with certain Unicode code points correctly, I imagine that's a fairly rare edge case when dealing with Splunked data ❤️

I guess both the split handling and the editor are bugs, as

| eval t=text_string
| eval tl=len(t)
| rex field=t mode=sed "s/❤️/_LuuuV_/"

both the length of 9 is correctly counting the two Unicode code points and rex replaces it correctly (less surprising).

You might expect that split() should give the two Unicode code points as separate split_text values, the first with the black heart and the second with some other (unknown) character, but the fact that it's converting it to 6 values, indicates it's misinterpreting the UTF8.

 

0 Karma

malvidin
Communicator

Because mvjoin() reverses the operation, the back end data does not appear to be lost. And since it is split into 6 characters, it appears that the back end data is being parsed as UTF8.

The second Unicode character in the red heart emoji is variation selector 16 (U+FE0F).

Using rex splits selects by character, but split() selects by UTF8 byte.

| rex field=text_string max_match=0 "(?P<text_split>.)" 

 

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...