Getting Data In

Why are my headers being indexed from my csv file?

lquinn
Contributor

I am trying to do a simple monitor data input of a csv file with the following format:

 Id,User,Action,_time,Comment
 6783493,Laura,Purchase,1426503622.15467,Some Comment

I have tried several different configurations but each time the headers get indexed! The csv file changes when a saved search runs and outputs the csv. The headers never change. Can anyone tell me what I'm doing wrong? Surely just the standard csv sourcetype should do the trick? Thanks!

1 Solution

lquinn
Contributor

The search that I was using to populate my csv, extracted the header fields using the rex command. For some reason when I then wrote over the monitored csv, this caused it to index the headers. I changed my search so that the header fields were extracted in props.conf rather than in the search string and the headers stopped being indexed! Not sure exactly why this was but there you go!

View solution in original post

0 Karma

proletariat99
Communicator

Sooo... I've been battling this same thing off and on for the last couple of years. I've learned a few things that might help. First, you have to decide whether you're indexing all extracted fields (not recomended) or if you're doing search time field extractions. What happens to me is that I always test my extractions on a standalone box and it works like a champ and then everything breaks down in our distributed prod / uat environment. Regardless, this might help:

For search-time extractions, most of the relevant props.conf entries will be on the search head. The indexer will only have settings associated with index-timey things (like timestamp, linemerge, line breaker, host, sourcetype, etc -- all the lightweight schema stuff). On the SH, though, you can use a combination of these settings to do the extractions from the header:

CHECK_FOR_HEADER = TRUE
HEADER_FIELD_LINE_NUMBER = <NUMBER> (this one is cranky and unreliable, but sometimes works)
KV_MODE = <CSV, JSON, XML, etc>  (this one is also cranky and unreliable, especially with xml)

and if you want to be explicit (recommended in a lot of cases), you can use REPORT

REPORT-name-of-report = name_of_transforms.conf_stanza

Then transforms.conf on the search head will look something like this:

[name_of_transforms.conf_stanza]
DELIMS = ","
FIELDS = field1, field2, field3, etc... (these values match the values in the header IDENTICALLY)

Soooo... while this is pretty recommended for a large-scale distributed environment, it doesn't work well a lot of the time because of the relationship between line breakers and timestamp extractions on the indexers and the search head .conf files. Essentially, you set it all up, you think it should work and then it doesn't (but it did on a standalone)... then troubleshooting sucks.

For index-time extractions, you can use a combination of the following settings:

INDEXED_EXTRACTIONS = <blah>
PREAMBLE_REGEX = <match some pattern in the header>  (this actually ignores the first line, but uses it for the field names.)  

So if you use PREAMBLE_REGEX, but want search time extractions, you can't (because that line is ignored by the time the search head sees it.).

Another method of troubleshooting, even if you don't plan on indexed extractions, is to turn on

INDEXED_EXTRACTIONS = csv

To see if it's your extractions on the SH or something else that's causing the problem.

And then there's the fishbucket :)... but that's another story... the hits just keep on coming...

lquinn
Contributor

The search that I was using to populate my csv, extracted the header fields using the rex command. For some reason when I then wrote over the monitored csv, this caused it to index the headers. I changed my search so that the header fields were extracted in props.conf rather than in the search string and the headers stopped being indexed! Not sure exactly why this was but there you go!

0 Karma

esix_splunk
Splunk Employee
Splunk Employee

Did you try to use

index_extractions = csv and
header_field_line_number = 1 for this source type?

http://docs.splunk.com/Documentation/Splunk/latest/Data/Extractfieldsfromfileheadersatindextime#Prop...

0 Karma

lquinn
Contributor

Yep, I've tried that, and headers are still being indexed.

0 Karma

proletariat99
Communicator

same here. bug?

0 Karma

lquinn
Contributor

The strange thing is that the headers are being extracted as field names but also as values. So I have User as a field value for User!

0 Karma

esix_splunk
Splunk Employee
Splunk Employee

I've just created a brand new csv and indexed it with the following:

props.conf
[indexed_extractions_test]
HEADER_FIELD_LINE_NUMBER=1
FIELD_DELIMITER=,
INDEXED_EXTRACTION = csv

csv-
os,range
AIX:Version,aix
FreeBSD:Version,freebsd
HPUX:Version,hpux
Linux:Version,linux
OSX:Version,osx
Solaris:Version,solaris
Unix:Version,unix

$splunk_home/bin/splunk add oneshot -index main -sourcetype indexed_extractions_test

The results are accurate. CSV is indexed without the header, and I have KV pairs for os=*:Version and range=solaris etc.

Make sure you are deleting the old indexed data before rerunning it.

0 Karma

lquinn
Contributor

I've worked out that I don't think it is the configurations at all. I can also index a csv and it works fine but when I overwrite it with my search, that is when it starts indexing the headers.

0 Karma

esix_splunk
Splunk Employee
Splunk Employee

What are you doing in your search? Also note that _time is a reserved field. So using this fieldname could create problems.

0 Karma

lquinn
Contributor

I've sorted it, thanks very much for your help! Sometimes you just need a bit of inspiration!

0 Karma

esix_splunk
Splunk Employee
Splunk Employee

Please let us know what you encountered, might help others down the road!

0 Karma

lquinn
Contributor

Thats what I'm thinking too, I'm going to try a few things and I will let you know!

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...