Solved: How to check if a field contains unicode

Iris_Pi · ‎05-14-2025

Hello Everyone,

I want to check if a field called "from_header_displayname" contains any Unicode.

Below is the event source, this example event contains the unicode of "\u0445":
"from_header_displayname": "'support@\u0445.comx.com'

And the following what I see from the web console, the unicode has been translated into "x" (note: it's not the real letter x, but something looks like x in the other language)
from_header_displayname: 'support@х.comx.com'

I used the following search but no luck:
index=email | regex from_header_displayname="[\u0000-\uffff]"
Error in 'SearchOperator:regex': The regex '[\u0000-\uffff]' is invalid. Regex: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u.

Please advise what should I use in this case.

Thanks in advance.

Regards,
Iris

livehybrid · ‎05-14-2025

To check if a field contains Unicode characters, you can use the regex command with a regular expression that matches non-ASCII characters, but if you're wanting to do filtering you might be better with something like match.

index=email 
| eval is_unicode = if(match(from_header_displayname, "[^\x00-\x7F]"), "true", "false")
| where is_unicode="true"

This search uses the match function to check if the from_header_displayname field contains any characters outside the ASCII range (\x00-\x7F). If it does, the is_unicode field is set to "true".

Alternatively, you can directly filter the events using the where command with the match function.

index=email 
| where match(from_header_displayname, "[^\x00-\x7F]")

Here is another working example:

| makeresults 
| eval from_header_displayname="support@\u0445.comx.com" 
| eval from_header_displayname_unicode="support@х.comx.com" 
| table from_header_displayname from_header_displayname_unicode 
| eval unicode_detected_raw=if(match(from_header_displayname,"[^\x00-\x7F]"),"Yes","No") 
| eval unicode_detected_unicode=if(match(from_header_displayname_unicode,"[^\x00-\x7F]"),"Yes","No")
| table from_header_displayname unicode_detected_raw from_header_displayname_unicode unicode_detected_unicode

Both of these approaches will help you identify events where the from_header_displayname field contains Unicode characters.

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

View solution in original post

Iris_Pi · ‎05-14-2025

Thank you all for your reply! it helps!

livehybrid · ‎05-14-2025

To check if a field contains Unicode characters, you can use the regex command with a regular expression that matches non-ASCII characters, but if you're wanting to do filtering you might be better with something like match.

index=email 
| eval is_unicode = if(match(from_header_displayname, "[^\x00-\x7F]"), "true", "false")
| where is_unicode="true"

This search uses the match function to check if the from_header_displayname field contains any characters outside the ASCII range (\x00-\x7F). If it does, the is_unicode field is set to "true".

Alternatively, you can directly filter the events using the where command with the match function.

index=email 
| where match(from_header_displayname, "[^\x00-\x7F]")

Here is another working example:

| makeresults 
| eval from_header_displayname="support@\u0445.comx.com" 
| eval from_header_displayname_unicode="support@х.comx.com" 
| table from_header_displayname from_header_displayname_unicode 
| eval unicode_detected_raw=if(match(from_header_displayname,"[^\x00-\x7F]"),"Yes","No") 
| eval unicode_detected_unicode=if(match(from_header_displayname_unicode,"[^\x00-\x7F]"),"Yes","No")
| table from_header_displayname unicode_detected_raw from_header_displayname_unicode unicode_detected_unicode

Both of these approaches will help you identify events where the from_header_displayname field contains Unicode characters.

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

bowesmana · ‎05-14-2025

Unicode includes ASCII characters, so 0000-ffff would include all 16 bit characters. If you are looking for any 16 bit characters you could do either of these

| eval hasUncode=if(match(string, "[^[:ascii:]]"), "HAS-NON-ASCII", "ASCII")
| eval hasUncode=if(match(string, "[^\x00-\xff]"), "HAS-16 BIT CHARS", "8-BIT")

The first character class is ascii and is checking for any characters NOT in the ascii range (0x00-0x7f) and the second is checking for any non 8 bit characters.

So, this example which includes your lower case Cyrillic x demonstrates

| makeresults 
| eval string=printf("{\"from_header_displayname\": \"'support@%c.comx.com'\"}", 1024+69)
| eval hasUncode1=if(match(string, "[^[:ascii:]]"), "HAS-NON-ASCII", "ASCII")
| eval hasUncode2=if(match(string, "[^\x00-\xff]"), "HAS-16-BIT", "8 BIT")

kiran_panchavat · ‎05-14-2025

@Iris_Pi

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

How to check if a field contains unicode

regex

rex

Get Operational Insights Quickly with Natural Language on the Splunk Platform

What’s New in Splunk Observability Cloud – June 2025

Almost Too Eventful Assurance: Part 2

Are you a member of the Splunk Community?

How to check if a field contains unicode

regex

rex

Get Operational Insights Quickly with Natural Language on the Splunk Platform

What’s New in Splunk Observability Cloud – June 2025

Almost Too Eventful Assurance: Part 2