I'm having som issues with how Splunk is handling event breaking for CSV files.
A sample of the CSV file in question:
USER;SYSTEMINFO;DATE;RT;VT;MESSAGE
USER1;Windows 10 Enterprise;20201128 06:05:27,862;5;0;"File not accessible ("90s"). Error: "Try again"."
USER2;Windows 10 Enterprise;20201128 06:05:29,288;15.9;5;""
USER3;Windows 10 Enterprise;20201128 06:15:25,463;5;0;"File not accessible ("90s"). Error: "Try again"."
USER4;Windows 10 Enterprise;20201128 06:15:26,830;21.3;0;""
In props.conf i have tried the following (this is a single instance installation of splunk with no additional forwarder and the input is local in this instance of Splunk):
[MyCSV]
FIELD_DELIMITER = ;
FIELD_QUOTE = "
HEADER_FIELD_DELIMITER = ;
HEADER_FIELD_QUOTE = None
SHOULD_LINEMERGE = false
KV_MODE = none
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
category = Structured
description = CSV input
disabled = false
pulldown_type = true
This works perfect in the cases where MESSAGE contains two double quotes. in the cases (like the example i provided) where the MESSAGE field contains multiple double quotes Splunk can't seem to break the event properly.
One event would end up like this:
USER1;Windows 10 Enterprise;20201128 06:05:27,862;5;0;"File not accessible ("90s"). Error: "Try again"."
USER2;Windows 10 Enterprise;20201128 06:05:29,288;15.9;5;""
USER3;Windows 10 Enterprise;20201128 06:15:25,463;5;0;"File not accessible ("90s"). Error: "Try again"."
But i would expect that to be broken into three separate events. From what i can gather Splunk has issues with the multiple double quotation marks and completely ignores the line_breaker regex.
If i change FIELD_QUOTE = " and instead put FIELD_QUOTE = None it does seem to break the events like it should, but the auto-extracted MESSAGE field will then contain quotation marks. With FIELD_QUOTE=" Splunk removes the double quotes around the event but fails to break the event properly.
I have also tried to change the double quotes inside the outer double quotes with SEDCMD like this:
SEDCMD-removeinnerquotes = s/(?<!;)"(?![\r\n]|$)/'/g
This is working so the indexed MESSAGE field is changed from:
"File not accessible ("90s"). Error: "Try again"."
to:
"File not accessible ('90s'). Error: 'Try again'."
However: Splunk is still auto extracting the MESSAGE field BEFORE my SEDCMD kicks in. So one event will now look like this:
USER1;Windows 10 Enterprise;20201128 06:05:27,862;5;0;"File not accessible ('90s'). Error: 'Try again'."
USER2;Windows 10 Enterprise;20201128 06:05:29,288;15.9;5;""
USER3;Windows 10 Enterprise;20201128 06:15:25,463;5;0;"File not accessible ('90s'). Error: 'Try again'."
And the MESSAGE field in the left side meny under "Interesting fields" will have the value like this:
File not accessible ("90s"). Error: "Try again".
So because this is not working i tried to remove all double quotes with SEDCMD like this:
SEDCMD-removequotes = s/(?<!;)"(?![\r\n]|$)/'/g s/"//g
FIELD_QUOTE=None
Causing event breaking to be done correctly (because of FIELD_QUOTE = None i suspect) and one event will now be correctly:
USER1;Windows 10 Enterprise;20201128 06:05:27,862;5;0;File not accessible ('90s'). Error: 'Try again'.
HOWEVER:
The MESSAGE field on the left hand side under "Interesting fields" still contains double quotes. Despite the event above where all double quotes is either changed or removed, the MESSAGE field will have the value like this:
"File not accessible ("90s"). Error: "Try again"."
- This seem to indicate that whatever i try Splunk will extract the field before any change i do and therefore make it near impossible to fix this issue. I have not set INDEXED_EXTRACTION = CSV and have left KV_MODE = none to try to keep Splunk from extracting fields at both indexing and search time, but it still extracts the fields. My guess is that this has to do with the header. As long as Splunk is looking for and using the header in the CSV file it will also auto-extract the fields. I have not found a way to change this behavior.
In my desperation i also tried to write a transforms stanza to remove the header, then remove the HEADER_FIELD_DELIMITER and HEADER_FIELD_QUOTE settings, but STILL Splunk is extracting the header and the fields that goes with them.
There is the HEADER_FIELD_LINE_NUMBER setting, but this can't be set to "none" and defaults to 0. Even if i set this to HEADER_FIELD_LINE_NUMBER = 1 and have the transforms to remove the header, Splunk extracts the HEADER fields automatically.
Does anyone have any tips for fixing this issue? What i want is the events to be broken down correctly with 1 event per line and to have the MESSAGE field values without quotation marks.
The obvious and best solution would be for the system to change it's logger and never include multiple double quotation marks in the MESSAGE text, but this is not possible i'm afraid, so i'm stuck trying to find a solution in Splunk.
Note again that this is a single instance Splunk Enterprise installation with no forwarder (local input).
In the hopes that it might help someone else this is the "solution" i've come up with:
I noticed that as long as i use any of the settings realted to structured data, like these:
FIELD_DELIMITER =
FIELD_QUOTE =
HEADER_FIELD_DELIMITER =
HEADER_FIELD_QUOTE =
Splunk seems to index csv fields no matter what i try to do.
When i remove all these settings and have a sourcetype that looks like this:
[MyCSV]
SHOULD_LINEMERGE = false
KV_MODE = none
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
disabled = false
splunk does NOT index fields and i can use props and transforms to "fix" the data (remove header line and extract fields at search time). This solved my problem.