Getting Data In

International character code recognition

melonman
Motivator

Hi there,

I would like to know how to handle international character code in Splunk.

The environment I have here is in Japanese. There are 3 character codes available for Japanese representation, SJIS, EUC and Unicode.

My question is, how Splunk detects the character code of the input, manage the character in index, and handle the characters during search operation.

Thanks!

0 Karma
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

Splunk tries to auto-detect the character set of an input. It may get it wrong. You can specify the character set of an input by setting CHARSET in the props.conf on the input node (i.e., the same server where the inputs.conf is configured), I would recommend specifying it in a [source::...] stanza in that file.

http://www.splunk.com/base/Documentation/4.1.4/Admin/Configurecharactersetencoding

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

Splunk tries to auto-detect the character set of an input. It may get it wrong. You can specify the character set of an input by setting CHARSET in the props.conf on the input node (i.e., the same server where the inputs.conf is configured), I would recommend specifying it in a [source::...] stanza in that file.

http://www.splunk.com/base/Documentation/4.1.4/Admin/Configurecharactersetencoding

gkanapathy
Splunk Employee
Splunk Employee

No. As noted, it uses heuristics based on the training set. As you noted, for log files, there will rarely if ever be encoding indicators, and they will be unreliable.

0 Karma

melonman
Motivator

I will simplify the question. Does splunk use universal encoding detector to detect character encoding of inputs?

0 Karma

melonman
Motivator

Thank you for charset related information! well, what I want to know is more like mechanism how Splunk detects char code. Like html or other format, usually charactor codes are specified in the header of the data itself. However, most of case, IT data is simple text and you don't usually know which char code the text is written in. The doc says that splunk does auto-detection, and I want to know how.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee
0 Karma

melonman
Motivator

I understand the encoding configuration, but I can't find the explanation how splunk does auto-detect the character set of an input. Is there any information about auto-detect the character set of an input?

0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.
Get Updates on the Splunk Community!

Tech Talk Recap | Mastering Threat Hunting

Mastering Threat HuntingDive into the world of threat hunting, exploring the key differences between ...

Observability for AI Applications: Troubleshooting Latency

If you’re working with proprietary company data, you’re probably going to have a locally hosted LLM or many ...

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

In the age of AI, every tool promises to make our lives easier. From summarizing content to writing code, ...