topic Hidden Characters in a .csv datasource in Splunk Search

Hidden Characters in a .csv datasource

raysonjoberts — Mon, 04 Mar 2024 18:11:26 GMT

I am having a random issue where it seems characters are present in a field which cannot be seen.
If you look in the results below, even though the results appear to match each other, Splunk does see these as 2 distinct values.
If I download and open the results, one of the two names has characters in it that are not seen when looking at the results in the Search App. If I open the file in my text editor, one of the two names is in quotes, if I open the file in Excel, one of the two names is preceded by ‚Äã.

It feels like a problem with the underlying lookup files (.csv), however this problem is not consistent, only a very small percentage of results has this incorrect format (<.005%). Trying to use regex or replace to remove non-alphanumeric values in a field does not seem to work, I am at a loss with it. Any idea how to remove "non-visible" characters or correct this formatting?

Re: Hidden Characters in a .csv datasource

bowesmana — Mon, 04 Mar 2024 22:53:07 GMT

Without knowing what the characters actually are, I can suggest this eval logic that may help you clean up the data

| eval tmpVM=split(VM, "") | eval newVM=mvjoin(mvmap(tmpVM, if(tmpVM>=" " AND tmpVM<="z", tmpVM, null())), "")

which will break the string up into the individual characters and then the mvmap will check that each character is between space and lower case z (which will cover most of the printable ASCII chars) and join it back together again.

If the goal is to fix up the csv, then this should work and you can rewrite the csv, but if this is a general problem with the CSV being written regularly, then you should try to see if you can understand the data that's getting in.

It sounds like it could be an encoding issue and there may be some spurious UTF (8 or 16) characters in there.

Those "‚Äã" character are all valid characters but they would show up as such in Splunk, so Excel is just doing its best.

You could also add in x=len(VM) to see how many additional characters are there, but you will also see the tmpVM variable in the above eval snippet shows you what Splunk thinks of the data.

Re: Hidden Characters in a .csv datasource

raysonjoberts — Tue, 05 Mar 2024 00:15:26 GMT

Thank you @bowesmana . I got sick of beating my head against a wall and put in a workaround of sorts. The .csv in question came from an outputlookup I ran against some indexed data. For some reason, I could not filter out non-alphanumeric characters from the .csv itself, but I could with the indexed data.. So I filtered it out with a rex statement, then re-ran my outputlookup to create a new .csv.

Thank you for taking the time to reply!