Splunk Search

How to check if a field contains unicode

Iris_Pi
Path Finder

Hello Everyone,

I want to check if a field called "from_header_displayname" contains any Unicode.

Below is the event source, this example event contains the unicode of "\u0445":
"from_header_displayname": "'support@\u0445.comx.com'

And the following what I see from the web console, the unicode has been translated into "x" (note: it's not the real letter x, but something looks like x in the other language)
from_header_displayname: 'support@х.comx.com'

I used the following search but no luck:
index=email | regex from_header_displayname="[\u0000-\uffff]"
Error in 'SearchOperator:regex': The regex '[\u0000-\uffff]' is invalid. Regex: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u.

Please advise what should I use in this case.

Thanks in advance.

Regards,
Iris

Labels (2)
0 Karma
1 Solution

livehybrid
Super Champion

To check if a field contains Unicode characters, you can use the regex command with a regular expression that matches non-ASCII characters, but if you're wanting to do filtering you might be better with something like match.

index=email 
| eval is_unicode = if(match(from_header_displayname, "[^\x00-\x7F]"), "true", "false")
| where is_unicode="true"

 

This search uses the match function to check if the from_header_displayname field contains any characters outside the ASCII range (\x00-\x7F). If it does, the is_unicode field is set to "true".

Alternatively, you can directly filter the events using the where command with the match function.

 

index=email 
| where match(from_header_displayname, "[^\x00-\x7F]")

Here is another working example:

| makeresults 
| eval from_header_displayname="support@\u0445.comx.com" 
| eval from_header_displayname_unicode="support@х.comx.com" 
| table from_header_displayname from_header_displayname_unicode 
| eval unicode_detected_raw=if(match(from_header_displayname,"[^\x00-\x7F]"),"Yes","No") 
| eval unicode_detected_unicode=if(match(from_header_displayname_unicode,"[^\x00-\x7F]"),"Yes","No")
| table from_header_displayname unicode_detected_raw from_header_displayname_unicode unicode_detected_unicode

Both of these approaches will help you identify events where the from_header_displayname field contains Unicode characters.

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

View solution in original post

Iris_Pi
Path Finder

Thank you all for your reply! it helps!

0 Karma

livehybrid
Super Champion

To check if a field contains Unicode characters, you can use the regex command with a regular expression that matches non-ASCII characters, but if you're wanting to do filtering you might be better with something like match.

index=email 
| eval is_unicode = if(match(from_header_displayname, "[^\x00-\x7F]"), "true", "false")
| where is_unicode="true"

 

This search uses the match function to check if the from_header_displayname field contains any characters outside the ASCII range (\x00-\x7F). If it does, the is_unicode field is set to "true".

Alternatively, you can directly filter the events using the where command with the match function.

 

index=email 
| where match(from_header_displayname, "[^\x00-\x7F]")

Here is another working example:

| makeresults 
| eval from_header_displayname="support@\u0445.comx.com" 
| eval from_header_displayname_unicode="support@х.comx.com" 
| table from_header_displayname from_header_displayname_unicode 
| eval unicode_detected_raw=if(match(from_header_displayname,"[^\x00-\x7F]"),"Yes","No") 
| eval unicode_detected_unicode=if(match(from_header_displayname_unicode,"[^\x00-\x7F]"),"Yes","No")
| table from_header_displayname unicode_detected_raw from_header_displayname_unicode unicode_detected_unicode

Both of these approaches will help you identify events where the from_header_displayname field contains Unicode characters.

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

bowesmana
SplunkTrust
SplunkTrust

Unicode includes ASCII characters, so 0000-ffff would include all 16 bit characters. If you are looking for any 16 bit characters you could do either of these

| eval hasUncode=if(match(string, "[^[:ascii:]]"), "HAS-NON-ASCII", "ASCII")
| eval hasUncode=if(match(string, "[^\x00-\xff]"), "HAS-16 BIT CHARS", "8-BIT")

The first character class is ascii and is checking for any characters NOT in the ascii range (0x00-0x7f) and the second is checking for any non 8 bit characters.

So, this example which includes your lower case Cyrillic x  demonstrates

| makeresults 
| eval string=printf("{\"from_header_displayname\": \"'support@%c.comx.com'\"}", 1024+69)
| eval hasUncode1=if(match(string, "[^[:ascii:]]"), "HAS-NON-ASCII", "ASCII")
| eval hasUncode2=if(match(string, "[^\x00-\xff]"), "HAS-16-BIT", "8 BIT")

 

kiran_panchavat
Influencer

@Iris_Pi 

kiran_panchavat_0-1747208902837.png

 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!
Get Updates on the Splunk Community!

Developer Spotlight with William Searle

The Splunk Guy: A Developer’s Path from Web to Cloud William is a Splunk Professional Services Consultant with ...

Major Splunk Upgrade – Prepare your Environment for Splunk 10 Now!

Attention App Developers: Test Your Apps with the Splunk 10.0 Beta and Ensure Compatibility Before the ...

Stay Connected: Your Guide to June Tech Talks, Office Hours, and Webinars!

What are Community Office Hours?Community Office Hours is an interactive 60-minute Zoom series where ...