Index a CSV with different data types

phil__tanner · ‎09-24-2019

Hi all - bit of a weird one! I've run out of ideas. Help please!

I'm trying to index some CSV files. However, the first line does not contain headers, it contains CSV info about the file itself (such as the start/end dates for the data covered in the file, the original sending organisation, and the type of data it relates to), and a different number of "columns" than the following rows.

I would like to parse this header row to extract these values, and append them as fields to each event contained within that file.

The Header & Detail rows are easily located (they are the first & every other row respectively, and start with HDR or DET respectively).

Sample data here would look like this:

HDR,ICPHHAB,10,XXX,XXX,YYYY,7/03/2019,15:04:34,20190307,1344,201902,E,I
DET,############,ZZZZZZ,F,1/02/2019,1,44.597,0,,X,,,
DET,############,ZZZZZZ,F,1/02/2019,2,44.972,0,,X,,,
DET,############,ZZZZZZ,F,1/02/2019,3,44.972,0,,X,,,
DET,############,ZZZZZZ,F,1/02/2019,4,44.947,0,,X,,,

And what I'm wanting to achieve would be to not include the HDR rows in the search results, but rather to pick up those bits of data & append them to each of the DET rows that follow.

Because these are files we are receiving, and they need to follow a certain industry standard, I would like to avoid doing any python style pre-processing (which I agree would be easiest route) in order to leave an audit trail of the original files.

I've tried various things from the props.conf like

PREAMBLE_REGEX = ^HDR\,(.*)

and transforms.conf

REGEX = (?=[^H]*(?:HDR|H.*HDR))^\w+,(?P<field1>\w+),(?P<`field2>\d+\.\d+),(?P<field3 >\w+), etc etc etc

But the HDR files are returning in my search, the fields from the HDR row are only appearing for those rows (not appending to DET ones), and the field extractions I'm doing in the props.conf for the DET rows are being polluted with HDR row (like col3 contains all values from col3, regardless of if it was a DET or HDR row).

richgalloway · ‎09-25-2019

You should write a scripted input the parses the file, does the needed transformations, and writes the results to stdout for Splunk to index.

---
If this reply helps you, Karma would be appreciated.

phil__tanner · ‎09-25-2019

As per original post:

Because these are files we are receiving, and they need to follow a certain industry standard, I would like to avoid doing any python style pre-processing (which I agree would be easiest route) in order to leave an audit trail of the original files.

Is there no way with REX / REGEX / props.conf / transforms.conf to achieve?

richgalloway · ‎09-25-2019

If you change the data using transforms then you haven't left an audit trail of the original file. Your scripted input can always copy the data to an archive location.

---
If this reply helps you, Karma would be appreciated.

Index a CSV with different data types

Aligning Observability Costs with Business Value: Practical Strategies

Mastering Data Pipelines: Unlocking Value with Splunk

Splunk Up Your Game: Why It's Time to Embrace Python 3.9+ and OpenSSL 3.0

Are you a member of the Splunk Community?

Index a CSV with different data types

Aligning Observability Costs with Business Value: Practical Strategies

Mastering Data Pipelines: Unlocking Value with Splunk

Splunk Up Your Game: Why It's Time to Embrace Python 3.9+ and OpenSSL 3.0