Splunk Search

How to check if a field contains unicode

Iris_Pi
Path Finder

Hello Everyone,

I want to check if a field called "from_header_displayname" contains any Unicode.

Below is the event source, this example event contains the unicode of "\u0445":
"from_header_displayname": "'support@\u0445.comx.com'

And the following what I see from the web console, the unicode has been translated into "x" (note: it's not the real letter x, but something looks like x in the other language)
from_header_displayname: 'support@х.comx.com'

I used the following search but no luck:
index=email | regex from_header_displayname="[\u0000-\uffff]"
Error in 'SearchOperator:regex': The regex '[\u0000-\uffff]' is invalid. Regex: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u.

Please advise what should I use in this case.

Thanks in advance.

Regards,
Iris

Labels (2)
0 Karma
1 Solution

livehybrid
SplunkTrust
SplunkTrust

To check if a field contains Unicode characters, you can use the regex command with a regular expression that matches non-ASCII characters, but if you're wanting to do filtering you might be better with something like match.

index=email 
| eval is_unicode = if(match(from_header_displayname, "[^\x00-\x7F]"), "true", "false")
| where is_unicode="true"

 

This search uses the match function to check if the from_header_displayname field contains any characters outside the ASCII range (\x00-\x7F). If it does, the is_unicode field is set to "true".

Alternatively, you can directly filter the events using the where command with the match function.

 

index=email 
| where match(from_header_displayname, "[^\x00-\x7F]")

Here is another working example:

| makeresults 
| eval from_header_displayname="support@\u0445.comx.com" 
| eval from_header_displayname_unicode="support@х.comx.com" 
| table from_header_displayname from_header_displayname_unicode 
| eval unicode_detected_raw=if(match(from_header_displayname,"[^\x00-\x7F]"),"Yes","No") 
| eval unicode_detected_unicode=if(match(from_header_displayname_unicode,"[^\x00-\x7F]"),"Yes","No")
| table from_header_displayname unicode_detected_raw from_header_displayname_unicode unicode_detected_unicode

Both of these approaches will help you identify events where the from_header_displayname field contains Unicode characters.

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

View solution in original post

Iris_Pi
Path Finder

Thank you all for your reply! it helps!

0 Karma

livehybrid
SplunkTrust
SplunkTrust

To check if a field contains Unicode characters, you can use the regex command with a regular expression that matches non-ASCII characters, but if you're wanting to do filtering you might be better with something like match.

index=email 
| eval is_unicode = if(match(from_header_displayname, "[^\x00-\x7F]"), "true", "false")
| where is_unicode="true"

 

This search uses the match function to check if the from_header_displayname field contains any characters outside the ASCII range (\x00-\x7F). If it does, the is_unicode field is set to "true".

Alternatively, you can directly filter the events using the where command with the match function.

 

index=email 
| where match(from_header_displayname, "[^\x00-\x7F]")

Here is another working example:

| makeresults 
| eval from_header_displayname="support@\u0445.comx.com" 
| eval from_header_displayname_unicode="support@х.comx.com" 
| table from_header_displayname from_header_displayname_unicode 
| eval unicode_detected_raw=if(match(from_header_displayname,"[^\x00-\x7F]"),"Yes","No") 
| eval unicode_detected_unicode=if(match(from_header_displayname_unicode,"[^\x00-\x7F]"),"Yes","No")
| table from_header_displayname unicode_detected_raw from_header_displayname_unicode unicode_detected_unicode

Both of these approaches will help you identify events where the from_header_displayname field contains Unicode characters.

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

bowesmana
SplunkTrust
SplunkTrust

Unicode includes ASCII characters, so 0000-ffff would include all 16 bit characters. If you are looking for any 16 bit characters you could do either of these

| eval hasUncode=if(match(string, "[^[:ascii:]]"), "HAS-NON-ASCII", "ASCII")
| eval hasUncode=if(match(string, "[^\x00-\xff]"), "HAS-16 BIT CHARS", "8-BIT")

The first character class is ascii and is checking for any characters NOT in the ascii range (0x00-0x7f) and the second is checking for any non 8 bit characters.

So, this example which includes your lower case Cyrillic x  demonstrates

| makeresults 
| eval string=printf("{\"from_header_displayname\": \"'support@%c.comx.com'\"}", 1024+69)
| eval hasUncode1=if(match(string, "[^[:ascii:]]"), "HAS-NON-ASCII", "ASCII")
| eval hasUncode2=if(match(string, "[^\x00-\xff]"), "HAS-16-BIT", "8 BIT")

 

kiran_panchavat
Champion

@Iris_Pi 

kiran_panchavat_0-1747208902837.png

 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.
Get Updates on the Splunk Community!

Calling All Security Pros: Ready to Race Through Boston?

Hey Splunkers, .conf25 is heading to Boston and we’re kicking things off with something bold, competitive, and ...

Beyond Detection: How Splunk and Cisco Integrated Security Platforms Transform ...

Financial services organizations face an impossible equation: maintain 99.9% uptime for mission-critical ...

Customer success is front and center at .conf25

Hi Splunkers, If you are not able to be at .conf25 in person, you can still learn about all the latest news ...