Archive

how to train Splunk to recognize a character set

Path Finder

Hello. My logs contain Simple Chinese characters. After setting CHARSET = GB2312 in the props.conf, some Chinese characters showed up correctly and some didn't. GB2312 encoding is a bit old. GB13000 is the current standard, and it recognizes more characters then GB2312 does. I figure if I can train Splunk to use GB13000 instead of GB2312, it may solve my problem. In the admin manual (http://www.splunk.com/base/Documentation/latest/Admin/Configurecharactersetencoding) it mentions that a sample character set specification file can be added to $SPLUNK_HOME/etc/ngram-models/ to train Splunk to recognize the character set. How do I create such file? Where can I find more information on this topic?

Thanks.

Tags (1)
0 Karma
1 Solution

Splunk Employee
Splunk Employee

Adding samples to ngram-models simply assists Splunk in guessing a CHARSET that we already support. It cannot be used to add support for a new charset. We have in product support for GB18030, GB231280 and GBK in addition to GB2312.

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

Adding samples to ngram-models simply assists Splunk in guessing a CHARSET that we already support. It cannot be used to add support for a new charset. We have in product support for GB18030, GB231280 and GBK in addition to GB2312.

View solution in original post

0 Karma

Path Finder

Thank you, Stephen.
I changed the props.conf to CHARSET=GB18030, and the problem was solved.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!