Archive

Splunk cannot index and search Charset UTF-8 without BOM

New Member

I have files encoded with UTF-8 without BOM(found out in notepad++), but splunk cannot index or search the events of these file. Due to some limitation, I cannot control the encoding format of the files. Is there any support of the charset UTF-8 without BOM in splunk?

Tags (3)
0 Karma

SplunkTrust
SplunkTrust

Hi kambiu,

Some background first: the UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify the file as an UTF-8 file.

Normally, the BOM is used to signal the endianness of the encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

If you have troubles with this source, you can add a CHARSET to the props.conf on the input of this source.
Example: if you have a universal forwarder, add it into props.conf of the universal forwarder to set a CHARSET.

hope this helps ...

cheers, MuS

0 Karma

New Member

Thanks for your answer. I think you are right and BOM does matter with the indexing of Splunk. I have found out another way to solve that issue. Thanks 🙂

0 Karma

Engager

Hi,

how did you solve your problem? because in my case it only indexes a part and then it is canceled

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!