Splunk Search

Best practice for dealing with Unicode codepoints in Splunk ?

smichalski
Explorer

Dear Splunkers,

I face logs, where special characters have been encoded into Unicode codepoints (e. g. \u0301 instead of ). While Splunk is able to process and store such logdata in general, searching for it gave me some headache:

Despite the raw events contain the encoded characters, Splunk decides to decode or convert the characters at some point, causing the search to return no results. For example: Within an eventsearch, I can search for the encoded string (here: \u0301) as part of a keyword or a value of the field _raw (the backslash must be escaped, understandably). But ...

... if checking the automatically extracted fields under "Interesting Fields", SplunkWeb displays the decoded string (here: ).
... if searching for the encoded OR decoded string within an automatically extracted field, Splunk finds nothing.
... if placing an eval or where before, the decoded string becomes searchable.
... if performing an in-search field extraction, the encoded string becomes searchable.

Personally I expected, that the automatically extracted field becomes decoded, while the _raw field stays encoded. I tested different charsets for indexing, but couldn't discover any differences. Therefore I wonder: What's the right way to deal with codepoints in Splunk?

My current solution to make events searchable bases on defining a calculated field as itself, so that users can filter for the logdata just like they are used to in their eventsearches. It works, But I deem that more a workaround than an actual solution.

Best regards, Sven


Help for setting up a test scenario

1. Create a JSON sample file with the following content:

(e. g. with CR LF + ANSI for Windows, or LF UTF-8 for Linux)

{
  "description": "04 - Tempe\u0301rature de liquide",
  "timestamp": "1499871600"
}

2. Upload sample file to Splunk ("Add data" menu in SplunkWeb)

Use predefined sourcetype _json
Use Charset UTF-8 or ISO-8859-1, for example
Use index main


3. Verify the result by checking the eventsearch

index="main" sourcetype="_json"


4. Search for the description value by adding the snippets behind the square brackets to your eventsearch

[Snippet Number: Returns Result Yes/No]

[A: N] "*04 - Tempe\u0301rature*"
[B: Y] "*04 - Tempe\\u0301rature*"
[C: Y] _raw="*04 - Tempe\\u0301rature*"
[D: N] description="04 - Tempe\\u0301rature*"
[E: Y] | eval description=description | search description="04 - Température de liquide"
[F: Y] | where like(description,"04 - Température de liquide")
[G: Y] | rex "\"description\":\s\"(?<description>[^\"]+)\"" | search description="04 - Tempe\\u0301rature*"
[H: N] | rex "\"description\":\s\"(?<description>[^\"]+)\"" | search description="04 - Température*"

LordLeet
Path Finder

Hello smichalski,

I'm experiencing a similar issue, did you find a way to solve it?

Regards,
LordLeet

0 Karma

bmacias84
Champion

If you have an unique sourcetype configured for that type of data you can use a simple props.conf to specify the char encode set.

Please read the following:
http://docs.splunk.com/Documentation/Splunk/latest/Data/Configurecharactersetencoding

Cheers,

0 Karma

ddrillic
Ultra Champion

U+0301 is the unicode hex value of the character Combining Acute Accent.
Based on Unicode Character “◌́” (U+0301)

The challenge here seems to be about converting the unicode to utf-8...

0 Karma
Get Updates on the Splunk Community!

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...

Updated Team Landing Page in Splunk Observability

We’re making some changes to the team landing page in Splunk Observability, based on your feedback. The ...