<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: how to train Splunk to recognize a character set in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/how-to-train-Splunk-to-recognize-a-character-set/m-p/48426#M179501</link>
    <description>&lt;P&gt;Adding samples to ngram-models simply assists Splunk in guessing a CHARSET that we already support. It cannot be used to add support for a new charset. We have in product support for GB18030, GB231280 and GBK in addition to GB2312.&lt;/P&gt;</description>
    <pubDate>Sun, 12 Sep 2010 04:35:08 GMT</pubDate>
    <dc:creator>Stephen_Sorkin</dc:creator>
    <dc:date>2010-09-12T04:35:08Z</dc:date>
    <item>
      <title>how to train Splunk to recognize a character set</title>
      <link>https://community.splunk.com/t5/Splunk-Search/how-to-train-Splunk-to-recognize-a-character-set/m-p/48425#M179500</link>
      <description>&lt;P&gt;Hello.
My logs contain Simple Chinese characters.
After setting CHARSET = GB2312 in the props.conf, some Chinese characters showed up correctly and some didn't.
GB2312 encoding is a bit old. GB13000 is the current standard, and it recognizes more characters then GB2312 does.  I figure if I can train Splunk to use GB13000 instead of GB2312, it may solve my problem.
In the admin manual (http://www.splunk.com/base/Documentation/latest/Admin/Configurecharactersetencoding) it mentions that a sample character set specification file can be added to $SPLUNK_HOME/etc/ngram-models/ to train Splunk to recognize the character set.
How do I create such file? Where can I find more information on this topic?&lt;/P&gt;

&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Sat, 11 Sep 2010 16:13:16 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/how-to-train-Splunk-to-recognize-a-character-set/m-p/48425#M179500</guid>
      <dc:creator>alextsui</dc:creator>
      <dc:date>2010-09-11T16:13:16Z</dc:date>
    </item>
    <item>
      <title>Re: how to train Splunk to recognize a character set</title>
      <link>https://community.splunk.com/t5/Splunk-Search/how-to-train-Splunk-to-recognize-a-character-set/m-p/48426#M179501</link>
      <description>&lt;P&gt;Adding samples to ngram-models simply assists Splunk in guessing a CHARSET that we already support. It cannot be used to add support for a new charset. We have in product support for GB18030, GB231280 and GBK in addition to GB2312.&lt;/P&gt;</description>
      <pubDate>Sun, 12 Sep 2010 04:35:08 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/how-to-train-Splunk-to-recognize-a-character-set/m-p/48426#M179501</guid>
      <dc:creator>Stephen_Sorkin</dc:creator>
      <dc:date>2010-09-12T04:35:08Z</dc:date>
    </item>
    <item>
      <title>Re: how to train Splunk to recognize a character set</title>
      <link>https://community.splunk.com/t5/Splunk-Search/how-to-train-Splunk-to-recognize-a-character-set/m-p/48427#M179502</link>
      <description>&lt;P&gt;Thank you, Stephen.&lt;BR /&gt;
I changed the props.conf to CHARSET=GB18030, and the problem was solved.&lt;/P&gt;</description>
      <pubDate>Tue, 14 Sep 2010 14:09:39 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/how-to-train-Splunk-to-recognize-a-character-set/m-p/48427#M179502</guid>
      <dc:creator>alextsui</dc:creator>
      <dc:date>2010-09-14T14:09:39Z</dc:date>
    </item>
  </channel>
</rss>

