Solved: Splunk/Hunk snappy orc files: no field extraction ...

burwell · ‎01-30-2017

Basic problem: in smart mode my fields are not getting extracted. All works in verbose mode. Also the time searching does work so I know that how I specify the time field does work.

Search that fails: index=foo | stats count by hii (or any field that isn't partitioned)

I have looked at the previous questions on Hunk extractions and smart mode (e.g. https://answers.splunk.com/answers/147879/why-hunks-field-extractor-behaves-differently-in-smart-mod...) but I cannot get mine to work.

we are using log files generated by spark: they are snappy compressed with the name ... snappy.orc
there is no metastore so I provide a fake database and table to make Splunk happy
i specify the exact fields and their types
I tried making all the fields or some of the fields required per Leon B's posts but that didn't help
I have the snappy jar on the THIRD_PARTY_JARS and Splunk is able to decompress the orc files

indexes.conf

vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter.hive.columnnames  = cqtq, ttms, chi, crc, pssc, psql, cqhm, cquc, caun, phr, psct, cquuc, cqtr, cqssl, cqssr, pitag, sstc, psqql, ttsfb,ttrq, cqbl, pttsfb, tfstoc, sscl, UA, tsso, sscc, phi, chp, Carpcqh, sssc, cqssv, cqssc, hii
vix.input.1.splitter.hive.columntypes = string:int:string:string:int:bigint:string:string:string:string:string,string,int:int:int:string:int:int:int:int:bigint:int:bigint:bigint:string,int:int:string:int:string:string:string:string:string
vix.input.1.required_fields           = cqtq,ttms,UA,hii
# Completely made up values to satisfy Splunk                                                                                                                      
vix.input.1.splitter.hive.tablename  = transfered
vix.input.1.splitter.hive.dbname     = default

in my provider i have vix.splunk.search.splitter = HiveSplitGenerator

props.conf

[source::/projects/flickr/flopsa/ycpi_spark/orc/...]
priority          = 202
sourcetype        = foo                                                                                                                                     
NO_BINARY_CHECK   = true

[foo]
NO_BINARY_CHECK = 1
SHOULD_LINEMERGE  = false
TIME_PREFIX       = cqtq\":
TIME_FORMAT       = %s.%3N

(Note I also tried these two which also get the time search to work but still not fields)

eval-_time=strptime('cqtq',"%s.%3N")                                                                                                                              
EXTRACT-_time=strptime('cqtq',"%s.%3N")

burwell · ‎02-01-2017

I got help from Splunk (thanks Raanan!) and this is my solution so others will know.

My indexes.conf

In my long columntypes list i had some commas instead of colons as separators. What I learned from Raanan was to just pull out the first few columns and when that works add in other fields. So above where I have string, int etc that should be string:int and so the columnnames didn't align with the columntypes
I shouldn't need to specify a dummy database and table name. We are filing a bug report.
In the index (not provider) I specified the following. This way you don't have to have several different providers. You can reuse:

vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter = HiveSplitGenerator

So altogether this is what worked (I changed the columnnames to be shorter and shortened the list to make things clearer)

[foo]
vix.provider                      = bt
vix.input.1.path                  = /my/path/...
vix.input.1.accept                = \.orc$
vix.input.1.ignore                = .+SUCCESS
vix.input.1.et.regex              = /my/path/regex...
vix.input.1.et.format             = yyyyMMddHH
vix.input.1.et.offset             = 0
vix.input.1.lt.regex              = /my/path/regex...
vix.input.1.lt.format             = yyyyMMddHH
vix.input.1.lt.offset             = 3600
vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter                  = HiveSplitGenerator
vix.input.1.required_fields           = cqtq,b
vix.input.1.splitter.hive.columnnames = cqtq,b,c,d,e,f,g,h,i 
vix.input.1.splitter.hive.columntypes = string:int:string:string:int:bigint:string:string:string.. etc
# Completely made up values to satisfy Splunk bug                                                                                       
vix.input.1.splitter.hive.tablename  = default
vix.input.1.splitter.hive.dbname     = default

Props.conf
To get search by time to properly work, in my props.conf I used the following. My time field is called cqtq.
It is 10 digit unix timestamp followed by a period then 3 digits. And at the beginning of each record.

eval-_time                = strptime('cqtq',"%s.%3N")

View solution in original post

burwell · ‎02-01-2017

I got help from Splunk (thanks Raanan!) and this is my solution so others will know.

My indexes.conf

In my long columntypes list i had some commas instead of colons as separators. What I learned from Raanan was to just pull out the first few columns and when that works add in other fields. So above where I have string, int etc that should be string:int and so the columnnames didn't align with the columntypes
I shouldn't need to specify a dummy database and table name. We are filing a bug report.
In the index (not provider) I specified the following. This way you don't have to have several different providers. You can reuse:

vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter = HiveSplitGenerator

So altogether this is what worked (I changed the columnnames to be shorter and shortened the list to make things clearer)

[foo]
vix.provider                      = bt
vix.input.1.path                  = /my/path/...
vix.input.1.accept                = \.orc$
vix.input.1.ignore                = .+SUCCESS
vix.input.1.et.regex              = /my/path/regex...
vix.input.1.et.format             = yyyyMMddHH
vix.input.1.et.offset             = 0
vix.input.1.lt.regex              = /my/path/regex...
vix.input.1.lt.format             = yyyyMMddHH
vix.input.1.lt.offset             = 3600
vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter                  = HiveSplitGenerator
vix.input.1.required_fields           = cqtq,b
vix.input.1.splitter.hive.columnnames = cqtq,b,c,d,e,f,g,h,i 
vix.input.1.splitter.hive.columntypes = string:int:string:string:int:bigint:string:string:string.. etc
# Completely made up values to satisfy Splunk bug                                                                                       
vix.input.1.splitter.hive.tablename  = default
vix.input.1.splitter.hive.dbname     = default

Props.conf
To get search by time to properly work, in my props.conf I used the following. My time field is called cqtq.
It is 10 digit unix timestamp followed by a period then 3 digits. And at the beginning of each record.

eval-_time                = strptime('cqtq',"%s.%3N")

Splunk/Hunk snappy orc files: no field extraction in Fast Mode

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life