I was trying to ingest some json files however the json seems to have some weird characters or binary and parsing failed.
Example of JSON:
I got this error : ERROR JsonLineBreaker - JSON Stream ID: xxxxxxxxxxxxxxxxxxxxxx had parsing error: Unexpected character while parsing backslash escape: 'x'
I had experimented on a lot of prof.conf including setting binary to false. I suspect this is something to do with encoding.
How do i solved this?
Thanks in advance
I checked the weird_characters are chinese character. I had set the encoding at UTF-8. I even try to modify my data to "abc": "\weird_characters". However, to no avail. I still cannot parse the data.
Does the JSON string (Assuming you have the correct CHARSET in props.conf) actually contain
\x? If so, you may have invalid JSON... check out the grammar on https://json.org The only characters that can follow a backslash in a string are slash, backslash, double quote, b, f, n, r, t, OR u (when immediately followed by 4 hex digits).
I manually removed the weird_characters and the JSON file can be ingested. However, these character are housed in the double quotes.
@acharlieh The file does not actually contain \x. However, I thought due to the encoding of these weird_characters, splunk might had recognized it as \x. I had set CHARSET to UTF-8 and the files continue to get the same error.
Anyone can help?
Where did you set the CHARSET? Just to double check this is on the Forwarder or other node performing ingestion yes? (Being an ingestion time thing). And you restarted the forwarder before trying ingesting one of these files again?
Is the source system actually producing the whole file as UTF-8 encoded JSON? How do you know?
Have you looked at your input in a good hex editor? If you're on Mac, I like HexFiend but there are many other good ones out there. The goal of this exercise is to know the actual bytes that are being ingested, and try to determine for certain what encoding is actually in place. A good editor will let you try out interpreting the bytes as a few different encodings, and see what is there when you do so. Using the output of this, and possibly a site like https://fileformat.info/info/unicode/ you can actually figure out what these "weird" characters actually are and reason about them.