Splunk Search

Best practice for dealing with Unicode codepoints in Splunk ?

smichalski
Explorer

Dear Splunkers,

I face logs, where special characters have been encoded into Unicode codepoints (e. g. \u0301 instead of ). While Splunk is able to process and store such logdata in general, searching for it gave me some headache:

Despite the raw events contain the encoded characters, Splunk decides to decode or convert the characters at some point, causing the search to return no results. For example: Within an eventsearch, I can search for the encoded string (here: \u0301) as part of a keyword or a value of the field _raw (the backslash must be escaped, understandably). But ...

... if checking the automatically extracted fields under "Interesting Fields", SplunkWeb displays the decoded string (here: ).
... if searching for the encoded OR decoded string within an automatically extracted field, Splunk finds nothing.
... if placing an eval or where before, the decoded string becomes searchable.
... if performing an in-search field extraction, the encoded string becomes searchable.

Personally I expected, that the automatically extracted field becomes decoded, while the _raw field stays encoded. I tested different charsets for indexing, but couldn't discover any differences. Therefore I wonder: What's the right way to deal with codepoints in Splunk?

My current solution to make events searchable bases on defining a calculated field as itself, so that users can filter for the logdata just like they are used to in their eventsearches. It works, But I deem that more a workaround than an actual solution.

Best regards, Sven


Help for setting up a test scenario

1. Create a JSON sample file with the following content:

(e. g. with CR LF + ANSI for Windows, or LF UTF-8 for Linux)

{
  "description": "04 - Tempe\u0301rature de liquide",
  "timestamp": "1499871600"
}

2. Upload sample file to Splunk ("Add data" menu in SplunkWeb)

Use predefined sourcetype _json
Use Charset UTF-8 or ISO-8859-1, for example
Use index main


3. Verify the result by checking the eventsearch

index="main" sourcetype="_json"


4. Search for the description value by adding the snippets behind the square brackets to your eventsearch

[Snippet Number: Returns Result Yes/No]

[A: N] "*04 - Tempe\u0301rature*"
[B: Y] "*04 - Tempe\\u0301rature*"
[C: Y] _raw="*04 - Tempe\\u0301rature*"
[D: N] description="04 - Tempe\\u0301rature*"
[E: Y] | eval description=description | search description="04 - Température de liquide"
[F: Y] | where like(description,"04 - Température de liquide")
[G: Y] | rex "\"description\":\s\"(?<description>[^\"]+)\"" | search description="04 - Tempe\\u0301rature*"
[H: N] | rex "\"description\":\s\"(?<description>[^\"]+)\"" | search description="04 - Température*"

LordLeet
Path Finder

Hello smichalski,

I'm experiencing a similar issue, did you find a way to solve it?

Regards,
LordLeet

0 Karma

bmacias84
Champion

If you have an unique sourcetype configured for that type of data you can use a simple props.conf to specify the char encode set.

Please read the following:
http://docs.splunk.com/Documentation/Splunk/latest/Data/Configurecharactersetencoding

Cheers,

0 Karma

ddrillic
Ultra Champion

U+0301 is the unicode hex value of the character Combining Acute Accent.
Based on Unicode Character “◌́” (U+0301)

The challenge here seems to be about converting the unicode to utf-8...

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas     Cisco Live 2026 is almost here, and this ...

What Is the Name of the USB Key Inserted by Bob Smith? (BOTS Hint, Not the Answer)

Hello Splunkers,   So you searched, “what is the name of the usb key inserted by bob smith?”  Not gonna lie… ...

Automating Threat Operations and Threat Hunting with Recorded Future

    Automating Threat Operations and Threat Hunting with Recorded Future June 29, 2026 | Register   Is your ...