Archive

How to know what encoding are the strange characters in my logs are using?

Path Finder

Hi,

I have a sample log below. I tried to upload this data and it shows the following preview. Is it possible to display the log file correctly? This is a log file sent to me by someone else.

alt text

Tags (1)
0 Karma

SplunkTrust
SplunkTrust

Basically, that white question mark in a black diamond tells you that the character is not unicode.

https://en.wikipedia.org/wiki/Specials_(Unicode_block)

I suspect, given what the values represent, that they are probably binary numbers that don't happen to hit a valid code block. I'm not sure whether (or how) you can tell splunk to extract them... Hmmm.


There are two directions you can go. One is to identify the actual underlying bytes, in which case you are going to have to use a utility on the file that is capable of seeing whatever is there, and telling you the hex byte values. (How you accomplish this is going to depend on what kind of tech you are using.)

The other is to go the opposite direction, and find out what encoding was used to create the file, and what utilities they are using to transmit it wherever it is going on the road to get to you. Somewhere along the path, some "helpful" machine is translating the code from one type to another.

https://www.centos.org/forums/viewtopic.php?t=54437
http://www.cybervaldez.com/how-to-remove-those-nasty-question-mark-with-a-diamond-symbols-from-appea...

here's a suggestion from this page - http://www.webhostingtalk.com/showthread.php?t=622439

You're on the right track - It's a character-set issue. Get a tool that inspects the response headers of the server (like the Firebug extension if you're using Mozilla Firefox) to see what character set the server response is sending with the content. If the server's character-set and the HTML character set of the actual content don't match up, you will see some strange looking characters like those little black diamond squares.

Then again, there's a third method, which is to take the most likely English codings from this page -- http://docs.splunk.com/Documentation/SplunkCloud/6.6.0/Data/Configurecharactersetencoding -- and try them each and see what happens. Since the rest of the logs are all in English, I would rule out all the non-English encodings.

0 Karma

Ultra Champion

Is it possible to paste the sequence of characters here?

0 Karma

Path Finder

I'm sorry. What do you meant by sequence of character? Currently there is only 1 black Diamond with question mark inside in question.

0 Karma

SplunkTrust
SplunkTrust

It's not absolutely certain how long a character is in unicode. That single black diamond might be 2-4 bytes long. (I'm betting it's a 4-byte binary integer.)

0 Karma