All Apps and Add-ons

Splunk/Hunk snappy orc files: no field extraction in Fast Mode

burwell
SplunkTrust
SplunkTrust

Basic problem: in smart mode my fields are not getting extracted. All works in verbose mode. Also the time searching does work so I know that how I specify the time field does work.

Search that fails: index=foo | stats count by hii (or any field that isn't partitioned)

I have looked at the previous questions on Hunk extractions and smart mode (e.g. https://answers.splunk.com/answers/147879/why-hunks-field-extractor-behaves-differently-in-smart-mod...) but I cannot get mine to work.

  • we are using log files generated by spark: they are snappy compressed with the name ... snappy.orc
  • there is no metastore so I provide a fake database and table to make Splunk happy
  • i specify the exact fields and their types
  • I tried making all the fields or some of the fields required per Leon B's posts but that didn't help
  • I have the snappy jar on the THIRD_PARTY_JARS and Splunk is able to decompress the orc files

indexes.conf

vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter.hive.columnnames  = cqtq, ttms, chi, crc, pssc, psql, cqhm, cquc, caun, phr, psct, cquuc, cqtr, cqssl, cqssr, pitag, sstc, psqql, ttsfb,ttrq, cqbl, pttsfb, tfstoc, sscl, UA, tsso, sscc, phi, chp, Carpcqh, sssc, cqssv, cqssc, hii
vix.input.1.splitter.hive.columntypes = string:int:string:string:int:bigint:string:string:string:string:string,string,int:int:int:string:int:int:int:int:bigint:int:bigint:bigint:string,int:int:string:int:string:string:string:string:string
vix.input.1.required_fields           = cqtq,ttms,UA,hii
# Completely made up values to satisfy Splunk                                                                                                                      
vix.input.1.splitter.hive.tablename  = transfered
vix.input.1.splitter.hive.dbname     = default
  • in my provider i have vix.splunk.search.splitter = HiveSplitGenerator

props.conf

[source::/projects/flickr/flopsa/ycpi_spark/orc/...]
priority          = 202
sourcetype        = foo                                                                                                                                     
NO_BINARY_CHECK   = true

[foo]
NO_BINARY_CHECK = 1
SHOULD_LINEMERGE  = false
TIME_PREFIX       = cqtq\":
TIME_FORMAT       = %s.%3N

(Note I also tried these two which also get the time search to work but still not fields)

eval-_time=strptime('cqtq',"%s.%3N")                                                                                                                              
EXTRACT-_time=strptime('cqtq',"%s.%3N")  
Tags (1)
0 Karma
1 Solution

burwell
SplunkTrust
SplunkTrust

I got help from Splunk (thanks Raanan!) and this is my solution so others will know.

My indexes.conf

  1. In my long columntypes list i had some commas instead of colons as separators. What I learned from Raanan was to just pull out the first few columns and when that works add in other fields. So above where I have string, int etc that should be string:int and so the columnnames didn't align with the columntypes

  2. I shouldn't need to specify a dummy database and table name. We are filing a bug report.

  3. In the index (not provider) I specified the following. This way you don't have to have several different providers. You can reuse:

    vix.input.1.splitter.hive.fileformat = orc
    vix.input.1.splitter = HiveSplitGenerator

So altogether this is what worked (I changed the columnnames to be shorter and shortened the list to make things clearer)

[foo]
vix.provider                      = bt
vix.input.1.path                  = /my/path/...
vix.input.1.accept                = \.orc$
vix.input.1.ignore                = .+SUCCESS
vix.input.1.et.regex              = /my/path/regex...
vix.input.1.et.format             = yyyyMMddHH
vix.input.1.et.offset             = 0
vix.input.1.lt.regex              = /my/path/regex...
vix.input.1.lt.format             = yyyyMMddHH
vix.input.1.lt.offset             = 3600
vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter                  = HiveSplitGenerator
vix.input.1.required_fields           = cqtq,b
vix.input.1.splitter.hive.columnnames = cqtq,b,c,d,e,f,g,h,i 
vix.input.1.splitter.hive.columntypes = string:int:string:string:int:bigint:string:string:string.. etc
# Completely made up values to satisfy Splunk bug                                                                                       
vix.input.1.splitter.hive.tablename  = default
vix.input.1.splitter.hive.dbname     = default

Props.conf
To get search by time to properly work, in my props.conf I used the following. My time field is called cqtq.
It is 10 digit unix timestamp followed by a period then 3 digits. And at the beginning of each record.

eval-_time                = strptime('cqtq',"%s.%3N")

View solution in original post

0 Karma

burwell
SplunkTrust
SplunkTrust

I got help from Splunk (thanks Raanan!) and this is my solution so others will know.

My indexes.conf

  1. In my long columntypes list i had some commas instead of colons as separators. What I learned from Raanan was to just pull out the first few columns and when that works add in other fields. So above where I have string, int etc that should be string:int and so the columnnames didn't align with the columntypes

  2. I shouldn't need to specify a dummy database and table name. We are filing a bug report.

  3. In the index (not provider) I specified the following. This way you don't have to have several different providers. You can reuse:

    vix.input.1.splitter.hive.fileformat = orc
    vix.input.1.splitter = HiveSplitGenerator

So altogether this is what worked (I changed the columnnames to be shorter and shortened the list to make things clearer)

[foo]
vix.provider                      = bt
vix.input.1.path                  = /my/path/...
vix.input.1.accept                = \.orc$
vix.input.1.ignore                = .+SUCCESS
vix.input.1.et.regex              = /my/path/regex...
vix.input.1.et.format             = yyyyMMddHH
vix.input.1.et.offset             = 0
vix.input.1.lt.regex              = /my/path/regex...
vix.input.1.lt.format             = yyyyMMddHH
vix.input.1.lt.offset             = 3600
vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter                  = HiveSplitGenerator
vix.input.1.required_fields           = cqtq,b
vix.input.1.splitter.hive.columnnames = cqtq,b,c,d,e,f,g,h,i 
vix.input.1.splitter.hive.columntypes = string:int:string:string:int:bigint:string:string:string.. etc
# Completely made up values to satisfy Splunk bug                                                                                       
vix.input.1.splitter.hive.tablename  = default
vix.input.1.splitter.hive.dbname     = default

Props.conf
To get search by time to properly work, in my props.conf I used the following. My time field is called cqtq.
It is 10 digit unix timestamp followed by a period then 3 digits. And at the beginning of each record.

eval-_time                = strptime('cqtq',"%s.%3N")
0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Calling All Security Pros: Ready to Race Through Boston?

Hey Splunkers, .conf25 is heading to Boston and we’re kicking things off with something bold, competitive, and ...

Beyond Detection: How Splunk and Cisco Integrated Security Platforms Transform ...

Financial services organizations face an impossible equation: maintain 99.9% uptime for mission-critical ...

Customer success is front and center at .conf25

Hi Splunkers, If you are not able to be at .conf25 in person, you can still learn about all the latest news ...