Solved: How to not index first X lines in CSV file ??

joshd · ‎02-08-2011

Hello,

I am trying to index a CSV file that has data arranged like so:

PHY_Short_CSW_CMA.csv
Serial number : 36358(RAID600)
From : 2011/02/08 15:36
To   : 2011/02/08 16:36
sampling rate : 1

"No.","time","CSW-1D.CACHE-1CA","CSW-2J.CACHE-1CA","CSW-1D.CACHE-2CC","CSW-2J.CACHE-2CC"
"1","2011/02/08 15:37",6,6,6,6
"2","2011/02/08 15:38",6,6,6,6

Now as you can see there is a header in the file just before the data ("No.","time"...) so that's what I want to use to separate the values when indexing the files. However due to the preceding 5 lines (filename, Serial number, From, To, sampling rate) it does not work as expected.

I've tried defining a transform to capture the first 5 lines and send it to the nullQueue then implemented this transform within the props.conf for the specific source:: but it does not seem to catch and the data still gets indexed and everything looks wonky.

Can anyone advise a good way to index files that have data arranged in this manner, or how can I ignore the first 5 lines (there's always 5 lines at the beginning of each file with the same arrangement) and just start indexing after it's processed the header?

Hopefully this make sense. Any help would be appreciated.

Thanks.

Ron_Naken · ‎02-13-2011

I don't believe that using SEDCMD or another transform will allow you to detect the fields from the header, since the header is on the 6th line when the detection takes place. If this is the issue, then you could use CHECK_FOR_HEADER = true with the following:

If the files will only be updated every hour, rather than continuously, you can use a scripted input or any scheduler (1x/hr) to place a cropped version of the files in another location. The new location will be the "monitored files" location.

MANGLE.SH:

#!/bin/bash
for i in *.csv; do sed "1,5d" $i >"/monitoredpath/$i"; done

This simple script will take all .csv files in a folder and place a corresponding cropped version in /monitoredpath. Use /monitoredpath as your monitored file location.

This isn't ideal, but it will even support the case where a new log file is generated of a type not previously seen. And Splunk shouldn't have any issues detecting the changes in the files, even though they are being overwritten every hour -- since new events still appear in the proper place at the end of each file.

Here is another thought, though I haven't tested this -- it seems ideal (in theory) and is easy to implement, if it would work:

Use a transform to make part of the filename be the sourcetype. Do a one-time index of a location containing just enough data for header detection for each sourcetype -- use a script to generate the header files. Then when the real files are indexed, field detection is already configured based on the sourcetype, so you could use SEDCMD or another transform to crop the top prior to indexing. This method would only support known sourcetypes that we pre-indexed for field recognition.

View solution in original post

securediversity · ‎04-29-2014

Well even when this is an old question I struggled about it today.

In my case I found a better way on doing this because it is easier to maintain and to implement in my eyes (I don't wanted to use SED or executing other scripts which would remove comment lines and so on - and btw each external other command or script needs to be maintained and well executed which means time/overhead).

1) this is my csv file:

earliest,latest
c01 first comment line
c02 second comment line
02/05/2014:15:5:0,02/05/2014:16:45:0

2) my search query:

|inputlookup start=2 myabove.csv  | return earliest latest

The caveat is that the first line have to contain the column headers or field names in splunk and you need to adjust the start=X when you need more comment lines. In my case I have made 20 comment lines and using 5 atm so I have some reserved 😉

Hint:

"start=2" is correct although the third line of the csv contains the time in the above example.
"start" will count only for events therefore the first line which contains the field names/header will not count. That said the c01 line would be the first, the c02 the second. And then you need to know that Splunk will always use the next line specified in "start=X". So if you would have 20 comment lines and 1 header "start" will be "start=20". (http://docs.splunk.com/Documentation/Splunk/6.0.1/SearchReference/Inputlookup)

Maybe that helps others (or me when I struggle again sometime).

Best regards
Thomas

Ron_Naken · ‎02-13-2011

I don't believe that using SEDCMD or another transform will allow you to detect the fields from the header, since the header is on the 6th line when the detection takes place. If this is the issue, then you could use CHECK_FOR_HEADER = true with the following:

If the files will only be updated every hour, rather than continuously, you can use a scripted input or any scheduler (1x/hr) to place a cropped version of the files in another location. The new location will be the "monitored files" location.

MANGLE.SH:

#!/bin/bash
for i in *.csv; do sed "1,5d" $i >"/monitoredpath/$i"; done

This simple script will take all .csv files in a folder and place a corresponding cropped version in /monitoredpath. Use /monitoredpath as your monitored file location.

This isn't ideal, but it will even support the case where a new log file is generated of a type not previously seen. And Splunk shouldn't have any issues detecting the changes in the files, even though they are being overwritten every hour -- since new events still appear in the proper place at the end of each file.

Here is another thought, though I haven't tested this -- it seems ideal (in theory) and is easy to implement, if it would work:

Use a transform to make part of the filename be the sourcetype. Do a one-time index of a location containing just enough data for header detection for each sourcetype -- use a script to generate the header files. Then when the real files are indexed, field detection is already configured based on the sourcetype, so you could use SEDCMD or another transform to crop the top prior to indexing. This method would only support known sourcetypes that we pre-indexed for field recognition.

joshd · ‎02-23-2011

Hey ron, thanks for the reply. This was actually the exact solution I ended up coming up with to meet my need. I am going to give your other idea a try on a test system here when I find some time. I appreciate your ideas!

joshd · ‎02-10-2011

gkanapathy - the values from the files header are not being appropriately mapped to the data in the given column. So I do not see "No." mapping to 1 or 2, I do not see "CSW-1D.CACHE-1CA" mapping to 6, etc... Maybe I'm missing something or how exactly this is suppose to work? I thought I've read what I can on the subject to understand, but feel free to point me to anything you may feel suiting. Thanks!

joshd · ‎02-10-2011

ron - These will be monitored files that are generated each hour. The header never changes but the data obviously does

gkanapathy · ‎02-10-2011

In what way it is not working as expected? The extra lines will be indexed, yes, but you should be able to run reports and queries that only look at the "valid" data pretty easily.

bwooden · ‎02-10-2011

How were you dropping the first five lines? Have you tried something like this:

props.conf

[source::.../PHY_Short_CSW_CMA.csv]
TRANSFORMS-nullq = nullQ
REPORT-csv = phy_csv

transforms.conf

[nullQ]
REGEX = ^[^,]+$
DEST_KEY = queue
FORMAT = nullQueue

[phy_csv]
DELIMS = ","
FIELDS = "No.","time","CSW-1D.CACHE-1CA","CSW-2J.CACHE-1CA","CSW-1D.CACHE-2CC","CSW-2J.CACHE-2CC"

joshd · ‎02-10-2011

I was thinking of doing this but there is more than one CSV file that I'm trying to index (517 to be exact) and only about a quarter of them have the same header, so it would be too exhausting to write each individually. Appreciate the help though.

I've also now dropped the first 6 lines by using sed before moving the files to the directory that splunk is monitoring.

Ron_Naken · ‎02-09-2011

Will these be monitored files or files that will be indexed one time and don't change?

How to not index first X lines in CSV file ??

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Splunk App Dev Quarterly Roundup: AI, Agents, and Innovation!

Federated Search for Dynamic Data Self Storage Is Now Generally Available on Splunk ...

Index This | What has many keys but can’t unlock a door?

Join the Conversation