topic Re: how to train Splunk to recognize a character set in Splunk Search

how to train Splunk to recognize a character set

alextsui — Sat, 11 Sep 2010 16:13:16 GMT

Hello. My logs contain Simple Chinese characters. After setting CHARSET = GB2312 in the props.conf, some Chinese characters showed up correctly and some didn't. GB2312 encoding is a bit old. GB13000 is the current standard, and it recognizes more characters then GB2312 does. I figure if I can train Splunk to use GB13000 instead of GB2312, it may solve my problem. In the admin manual (http://www.splunk.com/base/Documentation/latest/Admin/Configurecharactersetencoding) it mentions that a sample character set specification file can be added to $SPLUNK_HOME/etc/ngram-models/ to train Splunk to recognize the character set. How do I create such file? Where can I find more information on this topic?

Thanks.

Re: how to train Splunk to recognize a character set

Stephen_Sorkin — Sun, 12 Sep 2010 04:35:08 GMT

Adding samples to ngram-models simply assists Splunk in guessing a CHARSET that we already support. It cannot be used to add support for a new charset. We have in product support for GB18030, GB231280 and GBK in addition to GB2312.

Re: how to train Splunk to recognize a character set

alextsui — Tue, 14 Sep 2010 14:09:39 GMT

Thank you, Stephen.
I changed the props.conf to CHARSET=GB18030, and the problem was solved.