Hi all - bit of a weird one! I've run out of ideas. Help please!
I'm trying to index some CSV files. However, the first line does not contain headers, it contains CSV info about the file itself (such as the start/end dates for the data covered in the file, the original sending organisation, and the type of data it relates to), and a different number of "columns" than the following rows.
I would like to parse this header row to extract these values, and append them as fields to each event contained within that file.
The Header & Detail rows are easily located (they are the first & every other row respectively, and start with HDR or DET respectively).
Sample data here would look like this:
HDR,ICPHHAB,10,XXX,XXX,YYYY,7/03/2019,15:04:34,20190307,1344,201902,E,I
DET,############,ZZZZZZ,F,1/02/2019,1,44.597,0,,X,,,
DET,############,ZZZZZZ,F,1/02/2019,2,44.972,0,,X,,,
DET,############,ZZZZZZ,F,1/02/2019,3,44.972,0,,X,,,
DET,############,ZZZZZZ,F,1/02/2019,4,44.947,0,,X,,,
And what I'm wanting to achieve would be to not include the HDR rows in the search results, but rather to pick up those bits of data & append them to each of the DET rows that follow.
Because these are files we are receiving, and they need to follow a certain industry standard, I would like to avoid doing any python style pre-processing (which I agree would be easiest route) in order to leave an audit trail of the original files.
I've tried various things from the props.conf like
PREAMBLE_REGEX = ^HDR\,(.*)
and transforms.conf
REGEX = (?=[^H]*(?:HDR|H.*HDR))^\w+,(?P<field1>\w+),(?P<`field2>\d+\.\d+),(?P<field3 >\w+), etc etc etc
But the HDR files are returning in my search, the fields from the HDR row are only appearing for those rows (not appending to DET ones), and the field extractions I'm doing in the props.conf for the DET rows are being polluted with HDR row (like col3 contains all values from col3, regardless of if it was a DET or HDR row).
You should write a scripted input the parses the file, does the needed transformations, and writes the results to stdout for Splunk to index.
As per original post:
Because these are files we are receiving, and they need to follow a certain industry standard, I would like to avoid doing any python style pre-processing (which I agree would be easiest route) in order to leave an audit trail of the original files.
Is there no way with REX / REGEX / props.conf / transforms.conf to achieve?
If you change the data using transforms then you haven't left an audit trail of the original file. Your scripted input can always copy the data to an archive location.