Best practice for dealing with Unicode codepoints ...

smichalski · ‎07-17-2017

Dear Splunkers,

I face logs, where special characters have been encoded into Unicode codepoints (e. g. \u0301 instead of é). While Splunk is able to process and store such logdata in general, searching for it gave me some headache:

Despite the raw events contain the encoded characters, Splunk decides to decode or convert the characters at some point, causing the search to return no results. For example: Within an eventsearch, I can search for the encoded string (here: \u0301) as part of a keyword or a value of the field _raw (the backslash must be escaped, understandably). But ...

... if checking the automatically extracted fields under "Interesting Fields", SplunkWeb displays the decoded string (here: é).
... if searching for the encoded OR decoded string within an automatically extracted field, Splunk finds nothing.
... if placing an eval or where before, the decoded string becomes searchable.
... if performing an in-search field extraction, the encoded string becomes searchable.

Personally I expected, that the automatically extracted field becomes decoded, while the _raw field stays encoded. I tested different charsets for indexing, but couldn't discover any differences. Therefore I wonder: What's the right way to deal with codepoints in Splunk?

My current solution to make events searchable bases on defining a calculated field as itself, so that users can filter for the logdata just like they are used to in their eventsearches. It works, But I deem that more a workaround than an actual solution.

Best regards, Sven

Help for setting up a test scenario

1. Create a JSON sample file with the following content:

(e. g. with CR LF + ANSI for Windows, or LF UTF-8 for Linux)

{
  "description": "04 - Tempe\u0301rature de liquide",
  "timestamp": "1499871600"
}

2. Upload sample file to Splunk ("Add data" menu in SplunkWeb)

Use predefined sourcetype _json
Use Charset UTF-8 or ISO-8859-1, for example
Use index main

3. Verify the result by checking the eventsearch

index="main" sourcetype="_json"

4. Search for the description value by adding the snippets behind the square brackets to your eventsearch

[Snippet Number: Returns Result Yes/No]

[A: N] "*04 - Tempe\u0301rature*"
[B: Y] "*04 - Tempe\\u0301rature*"
[C: Y] _raw="*04 - Tempe\\u0301rature*"
[D: N] description="04 - Tempe\\u0301rature*"
[E: Y] | eval description=description | search description="04 - Température de liquide"
[F: Y] | where like(description,"04 - Température de liquide")
[G: Y] | rex "\"description\":\s\"(?<description>[^\"]+)\"" | search description="04 - Tempe\\u0301rature*"
[H: N] | rex "\"description\":\s\"(?<description>[^\"]+)\"" | search description="04 - Température*"

LordLeet · ‎10-04-2018

Hello smichalski,

I'm experiencing a similar issue, did you find a way to solve it?

Regards,
LordLeet

bmacias84 · ‎07-17-2017

If you have an unique sourcetype configured for that type of data you can use a simple props.conf to specify the char encode set.

Please read the following:
http://docs.splunk.com/Documentation/Splunk/latest/Data/Configurecharactersetencoding

Cheers,

ddrillic · ‎07-17-2017

U+0301 is the unicode hex value of the character Combining Acute Accent.
Based on Unicode Character “◌́” (U+0301)

The challenge here seems to be about converting the unicode to utf-8...