Knowledge Management

Does Splunk support the Japanese character encoding?

ddrillic
Ultra Champion

I wonder whether/how Splunk supports the Japanese character encoding. Assuming we have a stream of data encoded as utf-8 and also a stream of Japanese data. I wonder how it would work. I guess the first question is whether the Japanese characters can be encoded as utf-8...

Tags (2)
0 Karma
1 Solution

melonman
Motivator

You can find a list of supported encoding in the doc below.

http://docs.splunk.com/Documentation/Splunk/7.0.2/Data/Configurecharactersetencoding

But if the Japanese is already in UTF8 encoded already before being infested, you don’t have to do anything for encoding. The configuration you canc find in the doc above is to convert whateverjapanese encoding into UTF8 to get the data indexed.

You may need to be a bit careful when searching Japanese because of the way Splunk creates indexes.

View solution in original post

melonman
Motivator

Maybe I am wrong about how to store non-utf8 data. I respect Splunk engineer's comment and you should take Splunk comment as a definitive response for Splunk internal specification.

here is the expression in the doc.

Splunk software attempts to apply UTF-8 encoding to your sources by default. If a source does not use UTF-8 encoding or is a non-ASCII file, Splunk software tries to convert data from the source to UTF-8 encoding unless you specify a character set to use by setting the CHARSET key in props.conf.

For your reference, here is some information around how source data is processed.
https://wiki.splunk.com/Community:HowIndexingWorks
(see : Detail Diagram - Standalone Splunk)

The result from my quick test with splunk7.0.1 on MacOS using simple data in Shit-JIS.

# cat test.dat
1520460535 aiueo ����������
1520460545 kakikukeko ����������
1520460555 sasisuseso ����������
1520460565 tatituteto �����‚Ă�
1520460575 naninuneno �Ȃɂʂ˂�

# nkf --guess test.dat
Shift_JIS (CRLF)

# cat test.dat | nkf -w
1520460535 aiueo あいうえお
1520460545 kakikukeko かきくけこ
1520460555 sasisuseso さしすせそ
1520460565 tatituteto たちつてと
1520460575 naninuneno なにぬねの

test.dat is in Shift-JIS with CR/LF as linebreak.

There are 2 set of results so you can compare with or without CHARSET config in props.conf
1. Search result and lexicon without CHARSET=SHIFT-JIS
2. Search result and lexicon with CHARSET=SHIFT-JIS

1. Search result and lexicon without CHARSET=SHIFT-JIS

Search Result

# ~/Repository/_workspace/splunk701/bin/splunk search 'index=shiftjis'
1520460575 naninuneno \x82Ȃɂʂ˂\xCC
1520460565 tatituteto \x82\xBD\x82\xBF\x82‚Ă\xC6
1520460555 sasisuseso \x82\xB3\x82\xB5\x82\xB7\x82\xB9\x82\xBB
1520460545 kakikukeko \x82\xA9\x82\xAB\x82\xAD\x82\xAF\x82\xB1
1520460535 aiueo \x82\xA0\x82\xA2\x82\xA4\x82\xA6\x82\xA8

Walklex Result

# ~/Repository/_workspace/splunk701/bin/walklex ./1520460575-1520460535-4499156301793146808.tsidx '*'
my needle: *
0 5  host::mhyugaji.local
1 5  source::/Users/mhyugaji/Desktop/test.dat
2 5  sourcetype::nobincheck
3 1 1520460535
4 1 1520460545
5 1 1520460555
6 1 1520460565
7 1 1520460575
8 1 \x82
9 1 \x82\xa0\x82\xa2\x82\xa4\x82\xa6\x82\xa8
10 1 \x82\xa9\x82\xab\x82\xad\x82\xaf\x82\xb1
11 1 \x82\xb3\x82\xb5\x82\xb7\x82\xb9\x82\xbb
12 1 \x82\xbd\x82\xbf\x82
13 1 \xc6
14 1 \xcc
15 5 _indextime::1520462419
16 1 aiueo
17 5 date_hour::22
18 5 date_mday::7
19 1 date_minute::8
20 4 date_minute::9
21 5 date_month::march
22 1 date_second::15
23 1 date_second::25
24 1 date_second::35
25 1 date_second::5
26 1 date_second::55
27 5 date_wday::wednesday
28 5 date_year::2018
29 5 date_zone::0
30 5 host::mhyugaji.local
31 1 kakikukeko
32 5 linecount::1
33 1 naninuneno
34 1 punct::__\\
35 1 punct::__\\\\\\
36 3 punct::__\\\\\\\\\\
37 1 sasisuseso
38 5 source::/users/mhyugaji/desktop/test.dat
39 5 sourcetype::nobincheck
40 1 tatituteto
41 5 timeendpos::10
42 5 timestartpos::0
43 5 x82
44 1 xa0
45 1 xa2
46 1 xa4
47 1 xa6
48 1 xa8
49 1 xa9
50 1 xab
51 1 xad
52 1 xaf
53 1 xb1
54 1 xb3
55 1 xb5
56 1 xb7
57 1 xb9
58 1 xbb
59 1 xbd
60 1 xbf
61 1 xc6
62 1 xcc
63 1 ‚
64 1 Ă
65 1 Ȃ
66 1 ɂ
67 1 ʂ
68 1 ˂

2. Search result and lexicon with CHARSET=SHIFT-JIS

Search Result

# ~/Repository/_workspace/splunk701/bin/splunk search 'index=shiftjis'
1520460575 naninuneno なにぬねの
1520460565 tatituteto たちつてと
1520460555 sasisuseso さしすせそ
1520460545 kakikukeko かきくけこ
1520460535 aiueo あいうえお

Walklex Result

# ~/Repository/_workspace/splunk701/bin/walklex ./1520460575-1520460535-6100555980810343566.tsidx '*'
my needle: *
0 5  host::mhyugaji.local
1 5  source::/Users/mhyugaji/Desktop/test.dat
2 5  sourcetype::shiftjis_data
3 1 1520460535
4 1 1520460545
5 1 1520460555
6 1 1520460565
7 1 1520460575
8 5 _indextime::1520462792
9 1 aiueo
10 5 date_hour::22
11 5 date_mday::7
12 1 date_minute::8
13 4 date_minute::9
14 5 date_month::march
15 1 date_second::15
16 1 date_second::25
17 1 date_second::35
18 1 date_second::5
19 1 date_second::55
20 5 date_wday::wednesday
21 5 date_year::2018
22 5 date_zone::0
23 5 host::mhyugaji.local
24 1 kakikukeko
25 5 linecount::1
26 1 naninuneno
27 5 punct::__
28 1 sasisuseso
29 5 source::/users/mhyugaji/desktop/test.dat
30 5 sourcetype::shiftjis_data
31 1 tatituteto
32 5 timeendpos::10
33 5 timestartpos::0
34 1 あ
35 1 い
36 1 う
37 1 え
38 1 お
39 1 か
40 1 き
41 1 く
42 1 け
43 1 こ
44 1 さ
45 1 し
46 1 す
47 1 せ
48 1 そ
49 1 た
50 1 ち
51 1 つ
52 1 て
53 1 と
54 1 な
55 1 に
56 1 ぬ
57 1 ね
58 1 の

Some searches on Japaese and alphabet.

# ~/Repository/_workspace/splunk701/bin/splunk search 'index=shiftjis a'
<no result returned>
# ~/Repository/_workspace/splunk701/bin/splunk search 'index=shiftjis あ'
1520460535 aiueo あいうえお
# ~/Repository/_workspace/splunk701/bin/splunk search 'index=shiftjis ue'
<no result returned>
# ~/Repository/_workspace/splunk701/bin/splunk search 'index=shiftjis うえ'
1520460535 aiueo あいうえお
# ~/Repository/_workspace/splunk701/bin/splunk search 'index=shiftjis aiueo'
1520460535 aiueo あいうえお

you may notice this as well as in the walkrex result, Japanese is treated as uni-gram.
If you run a search 'index=shiftjis あ' to get events with single character 'あ' , you will get not only events with single あ character, but you will also see events the words that includes character 'あ'.

ManojAIG
New Member

Do we need to deploy props in UF for this configuration to get effective?
As I tested in splunk test server with charset=shift-jis I am getting japanise data but not in production as props is deployed to only indexer.

0 Karma

melonman
Motivator

CHARSET in props? Yes, you need the config in UF as shown in '4. Detail Diagram - UF/LWF to Indexer' in the following diagram. ( https://wiki.splunk.com/Community:HowIndexingWorks )

alt text

0 Karma

ddrillic
Ultra Champion

Thank you @melonman for the detailed information!!!

0 Karma

melonman
Motivator

You can find a list of supported encoding in the doc below.

http://docs.splunk.com/Documentation/Splunk/7.0.2/Data/Configurecharactersetencoding

But if the Japanese is already in UTF8 encoded already before being infested, you don’t have to do anything for encoding. The configuration you canc find in the doc above is to convert whateverjapanese encoding into UTF8 to get the data indexed.

You may need to be a bit careful when searching Japanese because of the way Splunk creates indexes.

ManojAIG
New Member

Do we need to deploy props in UF for this configuration to get effective?
As I tested in splunk test server with charset=shift-jis I am getting japanise data but not in production as props is deployed to only indexer.

0 Karma

ddrillic
Ultra Champion

Thank you @melonman.

-- You may need to be a bit careful when searching Japanese because of the way Splunk creates indexes.

What do you mean by saying that?

And let's say we specify -

[spec]
CHARSET=EUC-JP

Or

[spec]
CHARSET=SHIFT-JIS

Internally, Splunk will store the data as UTF-8, right?

0 Karma

ddrillic
Ultra Champion

our Sales Engineering says that in such cases, Splunk won't convert to UTF-8, but rather keep the data as is in binary form. Interesting how it works...

0 Karma

melonman
Motivator

Yea, maybe if they are talking about the RAW data, the data probably is kept as it is. For lexicon or things in txidx file, that may be a different story. You should also seek definitive answers from Splunk Sales Engineer. (Sorry I am not Splunk employee 🙂 But as you see, the above is what I got in a hardway.

Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...