Getting Data In

CSV files indexing with a second structure (new header) with associated values

SplunkTrust
SplunkTrust

Hi !

Currently working for a quite complex Application, i am indexing many csv files contains within Zip files.

This data has the following tabular format:

timestamp,device1,device2,device3...
timestamp,value1,value2,value3...

And so on, up to 128 columns.

Everything was working perfectly, with a configuration as:

props.conf

[hds_perf]

# your settings
INDEXED_EXTRACTIONS=csv
NO_BINARY_CHECK=1
SHOULD_LINEMERGE=false

# set by detected source type
KV_MODE=none
pulldown_type=true

# Time zone of HDS data is UTC/GMT
TZ=UTC

In limits.conf, i had to set the kv limit to allow more than 50 columns to be indexed:

[kv]
# when non-zero, the point at which kv should stop creating new columns
maxcols  = 512
# maximum number of keys auto kv can generate
limit    = 256
# truncate _raw to to this size and then do auto KV
maxchars = 10240

BUT... i lately discovered that the manufactor extracting tool (this is big data coming from storage Array) split a csv file (mostly for some like devices) in 2 part within the same file.

In exactly line "1448" of every files concerned, a new header is written containing the rest of devices between 129 and 256 (256 is the max technical number of device per unit)

Splunk can't natively work with that, as mentioned in Docs:

http://docs.splunk.com/Documentation/Splunk/6.1.1/Data/Extractfieldsfromfileheadersatindextime

And specially:

Splunk Enterprise does not support
renaming of header fields mid-file
Some software, such as Internet
Information Server, supports the
renaming of header fields in the
middle of the file. Splunk does not
recognize changes such as this. If you
attempt to index a file which has
header fields renamed within the file,
Splunk does not index the renamed
header field.

Off course, i understand and the message is clear enough, but i keep hope that some advanced technique like redirecting some part of the file to null queue, and some other not, or some technique to simulate having 2 source type for the same file could be possible

Or perhaps some regex stuff, i don't know yet...

I anyone would have some idea on how this could be managed, i'm sure this would be an interesting case for others 🙂

Thanks in advance for any help and answer!

0 Karma
1 Solution

SplunkTrust
SplunkTrust

Cannot be natively managed by Splunk, and requires a third party script to pre-process the data

View solution in original post

0 Karma

SplunkTrust
SplunkTrust

You can use a LINE_BREAKER to break the events, like this

Props.conf
[sourcetypeName]
LINE_BREAKER=([\n\r]+)regexThatMarches2ndHeaderHere
TRANSFORMS-aaa=transform1,transform2

transforms.conf
[transform1]
REGEX=regexToExtract128FieldsinData1

[transform2]
REGEX=regexToExtractFieldaInData2

0 Karma

SplunkTrust
SplunkTrust

Cannot be natively managed by Splunk, and requires a third party script to pre-process the data

View solution in original post

0 Karma

SplunkTrust
SplunkTrust

Found this answer while looking for something else and I disagree that this can’t be handled by splunk. See my answer for more details.

Just note with large csv files you may also have to tweak limits.conf [kv] stanza values too get all the fields to display in search.

0 Karma

SplunkTrust
SplunkTrust

My raw data header is as follows:

"No.","time",...

0 Karma

SplunkTrust
SplunkTrust

Just found this post:

http://answers.splunk.com/answers/107021/indexing-data-with-multiple-headers

It seems a line breaker could split my csv file as i have a new header like:

No. time Device1 Device2 ...

Trie adding this in data preview:

LINE_BREAKER = ([\r\n]+)"No."

No sucess yet...

0 Karma