Getting Data In

How to not index first X lines in CSV file ??

joshd
Builder

Hello,

I am trying to index a CSV file that has data arranged like so:

PHY_Short_CSW_CMA.csv
Serial number : 36358(RAID600)
From : 2011/02/08 15:36
To   : 2011/02/08 16:36
sampling rate : 1

"No.","time","CSW-1D.CACHE-1CA","CSW-2J.CACHE-1CA","CSW-1D.CACHE-2CC","CSW-2J.CACHE-2CC"
"1","2011/02/08 15:37",6,6,6,6
"2","2011/02/08 15:38",6,6,6,6

Now as you can see there is a header in the file just before the data ("No.","time"...) so that's what I want to use to separate the values when indexing the files. However due to the preceding 5 lines (filename, Serial number, From, To, sampling rate) it does not work as expected.

I've tried defining a transform to capture the first 5 lines and send it to the nullQueue then implemented this transform within the props.conf for the specific source:: but it does not seem to catch and the data still gets indexed and everything looks wonky.

Can anyone advise a good way to index files that have data arranged in this manner, or how can I ignore the first 5 lines (there's always 5 lines at the beginning of each file with the same arrangement) and just start indexing after it's processed the header?

Hopefully this make sense. Any help would be appreciated.

Thanks.

Tags (2)
1 Solution

Ron_Naken
Splunk Employee
Splunk Employee

I don't believe that using SEDCMD or another transform will allow you to detect the fields from the header, since the header is on the 6th line when the detection takes place. If this is the issue, then you could use CHECK_FOR_HEADER = true with the following:

If the files will only be updated every hour, rather than continuously, you can use a scripted input or any scheduler (1x/hr) to place a cropped version of the files in another location. The new location will be the "monitored files" location.

MANGLE.SH:

#!/bin/bash
for i in *.csv; do sed "1,5d" $i >"/monitoredpath/$i"; done

This simple script will take all .csv files in a folder and place a corresponding cropped version in /monitoredpath. Use /monitoredpath as your monitored file location.

This isn't ideal, but it will even support the case where a new log file is generated of a type not previously seen. And Splunk shouldn't have any issues detecting the changes in the files, even though they are being overwritten every hour -- since new events still appear in the proper place at the end of each file.

Here is another thought, though I haven't tested this -- it seems ideal (in theory) and is easy to implement, if it would work:

Use a transform to make part of the filename be the sourcetype. Do a one-time index of a location containing just enough data for header detection for each sourcetype -- use a script to generate the header files. Then when the real files are indexed, field detection is already configured based on the sourcetype, so you could use SEDCMD or another transform to crop the top prior to indexing. This method would only support known sourcetypes that we pre-indexed for field recognition.

View solution in original post

securediversity
Explorer

Well even when this is an old question I struggled about it today.

In my case I found a better way on doing this because it is easier to maintain and to implement in my eyes (I don't wanted to use SED or executing other scripts which would remove comment lines and so on - and btw each external other command or script needs to be maintained and well executed which means time/overhead).

1) this is my csv file:

earliest,latest
c01 first comment line
c02 second comment line
02/05/2014:15:5:0,02/05/2014:16:45:0

2) my search query:

|inputlookup start=2 myabove.csv  | return earliest latest

The caveat is that the first line have to contain the column headers or field names in splunk and you need to adjust the start=X when you need more comment lines. In my case I have made 20 comment lines and using 5 atm so I have some reserved 😉

Hint:

"start=2" is correct although the third line of the csv contains the time in the above example.
"start" will count only for events therefore the first line which contains the field names/header will not count. That said the c01 line would be the first, the c02 the second. And then you need to know that Splunk will always use the next line specified in "start=X". So if you would have 20 comment lines and 1 header "start" will be "start=20". (http://docs.splunk.com/Documentation/Splunk/6.0.1/SearchReference/Inputlookup)

Maybe that helps others (or me when I struggle again sometime).

Best regards
Thomas

0 Karma

Ron_Naken
Splunk Employee
Splunk Employee

I don't believe that using SEDCMD or another transform will allow you to detect the fields from the header, since the header is on the 6th line when the detection takes place. If this is the issue, then you could use CHECK_FOR_HEADER = true with the following:

If the files will only be updated every hour, rather than continuously, you can use a scripted input or any scheduler (1x/hr) to place a cropped version of the files in another location. The new location will be the "monitored files" location.

MANGLE.SH:

#!/bin/bash
for i in *.csv; do sed "1,5d" $i >"/monitoredpath/$i"; done

This simple script will take all .csv files in a folder and place a corresponding cropped version in /monitoredpath. Use /monitoredpath as your monitored file location.

This isn't ideal, but it will even support the case where a new log file is generated of a type not previously seen. And Splunk shouldn't have any issues detecting the changes in the files, even though they are being overwritten every hour -- since new events still appear in the proper place at the end of each file.

Here is another thought, though I haven't tested this -- it seems ideal (in theory) and is easy to implement, if it would work:

Use a transform to make part of the filename be the sourcetype. Do a one-time index of a location containing just enough data for header detection for each sourcetype -- use a script to generate the header files. Then when the real files are indexed, field detection is already configured based on the sourcetype, so you could use SEDCMD or another transform to crop the top prior to indexing. This method would only support known sourcetypes that we pre-indexed for field recognition.

joshd
Builder

Hey ron, thanks for the reply. This was actually the exact solution I ended up coming up with to meet my need. I am going to give your other idea a try on a test system here when I find some time. I appreciate your ideas!

0 Karma

joshd
Builder

gkanapathy - the values from the files header are not being appropriately mapped to the data in the given column. So I do not see "No." mapping to 1 or 2, I do not see "CSW-1D.CACHE-1CA" mapping to 6, etc... Maybe I'm missing something or how exactly this is suppose to work? I thought I've read what I can on the subject to understand, but feel free to point me to anything you may feel suiting. Thanks!

0 Karma

joshd
Builder

ron - These will be monitored files that are generated each hour. The header never changes but the data obviously does

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

In what way it is not working as expected? The extra lines will be indexed, yes, but you should be able to run reports and queries that only look at the "valid" data pretty easily.

0 Karma

bwooden
Splunk Employee
Splunk Employee

How were you dropping the first five lines? Have you tried something like this:

props.conf

[source::.../PHY_Short_CSW_CMA.csv]
TRANSFORMS-nullq = nullQ
REPORT-csv = phy_csv

transforms.conf

[nullQ]
REGEX = ^[^,]+$
DEST_KEY = queue
FORMAT = nullQueue

[phy_csv]
DELIMS = ","
FIELDS = "No.","time","CSW-1D.CACHE-1CA","CSW-2J.CACHE-1CA","CSW-1D.CACHE-2CC","CSW-2J.CACHE-2CC"

joshd
Builder

I was thinking of doing this but there is more than one CSV file that I'm trying to index (517 to be exact) and only about a quarter of them have the same header, so it would be too exhausting to write each individually. Appreciate the help though.

I've also now dropped the first 6 lines by using sed before moving the files to the directory that splunk is monitoring.

Ron_Naken
Splunk Employee
Splunk Employee

Will these be monitored files or files that will be indexed one time and don't change?

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...