Getting Data In

What's the trick to get unarchive_cmd to work for a custom archive format?

Lowell
Super Champion

I've been attempting to get the unarchive_cmd to work with no success. I decided to abandon my ultimate goal for the purpose of troubleshooting and what I'm finding is that I can't get even the most simple use-case to work.

My simplified use case is simple adding support for an "imaginary" gzip-2 format. (Which is really just gzip, with ".gz2" as the extension instead of ".gz"). So I copied Splunk's default props.conf settings for handling "gzip" files and just basically add a "2" to the end of everything. My configuratoin looks like so:

props.conf:

[preprocess-gzip2]
invalid_cause = archive
is_valid = False
LEARN_MODEL = false

[source::....gz2(.\d+)?]
unarchive_cmd = gzip -cd -
sourcetype = preprocess-gzip2
NO_BINARY_CHECK = true

inputs.conf:

[monitor:///tmp/indexme] 

I've tried adding unarchive_cmd to the [preprocess-gzip2] with no luck. I also added priority = 90 to [source::....gz2(.\d+)?]. Again no luck. I also tried adding sourcetype=preprocess-gzip2 to the monitor stanza -- no change.

The behavior I'm seeing is this:

  1. Any normal gzip file ending with .gz is indexed successfully.
  2. Any gzip file ending with .gz2 fails to be indexed at all. No meaningful error (or debug) messages explaining why.

I cranked up a few log sources for debugging purposes. Perhaps someone can shed some light on what I'm doing wrong. Again, my long-term goal is to implement some pre-processing command other than gzip -cd - but I'm just trying to boil this down to the most simple case to get things started.

Splunkd log - First attempt - Loading a .gz2 file [FAILS]

07-04-2014 15:34:21.631 +0000 INFO  TailingProcessor - Archive file='/tmp/indexme/test1-custom-handler-gz2.gz2' updated less than 10000ms ago, will not read it until it stops changing.
07-04-2014 15:34:31.632 +0000 INFO  TailingProcessor - Archive file='/tmp/indexme/test1-custom-handler-gz2.gz2' has stopped changing, will read it now.
07-04-2014 15:34:31.632 +0000 INFO  ArchiveProcessor - handling file=/tmp/indexme/test1-custom-handler-gz2.gz2
07-04-2014 15:34:31.633 +0000 INFO  ArchiveProcessor - reading path=/tmp/indexme/test1-custom-handler-gz2.gz2 (seek=0 len=1337)
07-04-2014 15:34:31.633 +0000 DEBUG ArchiveContext - /tmp/indexme/test1-custom-handler-gz2.gz2 is NOT an archive file.
07-04-2014 15:34:31.719 +0000 DEBUG ArchiveProcessor -   Attempting to load indexed extractions config from conf=/tmp/indexme/test1-custom-handler-gz2.gz2||preprocess-gzip2| ...
07-04-2014 15:34:31.720 +0000 DEBUG ArchiveContext - /tmp/indexme/test1-custom-handler-gz2.gz2 is NOT an archive file.
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveProcessor -   Basing small file initCrc on path=/tmp/indexme/test1-custom-handler-gz2.
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveProcessor -   This archive stream's initcrc=0x2e6fa8b616493a17.
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveProcessor -   Found no initcrc match for this stream, will re-read entire file.
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveContext - Early termination because write has false notification
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveContext - /tmp/indexme/test1-custom-handler-gz2.gz2 is NOT an archive file.
07-04-2014 15:34:31.747 +0000 DEBUG ArchiveProcessor - fname=/tmp/indexme/test1-custom-handler-gz2.gz2 fishstate=key=0xb5635587184bcff4 sptr=1337 scrc=0x65e7d9f9bdd1b805 fnamecrc=0xba14100c169eff7e modtime=0
07-04-2014 15:34:31.747 +0000 INFO  ArchiveProcessor - Finished processing file '/tmp/indexme/test1-custom-handler-gz2.gz2', removing from stats

Splunkd log - Second attempt - Loading a typical .gz file [WORKS]

07-04-2014 15:36:18.722 +0000 INFO  TailingProcessor - Archive file='/tmp/indexme/test2-standard-gz-file.gz' updated less than 10000ms ago, will not read it until it stops changing.
07-04-2014 15:36:28.724 +0000 INFO  TailingProcessor - Archive file='/tmp/indexme/test2-standard-gz-file.gz' has stopped changing, will read it now.
07-04-2014 15:36:28.725 +0000 INFO  ArchiveProcessor - handling file=/tmp/indexme/test2-standard-gz-file.gz
07-04-2014 15:36:28.725 +0000 INFO  ArchiveProcessor - reading path=/tmp/indexme/test2-standard-gz-file.gz (seek=0 len=1344)
07-04-2014 15:36:28.725 +0000 DEBUG ArchiveContext - /tmp/indexme/test2-standard-gz-file.gz is NOT an archive file.
07-04-2014 15:36:28.858 +0000 DEBUG ArchiveProcessor -   Attempting to load indexed extractions config from conf=/tmp/indexme/test2-standard-gz-file.gz||test2-standard-gz-file-too_small| ...
07-04-2014 15:36:28.858 +0000 DEBUG ArchiveContext - /tmp/indexme/test2-standard-gz-file.gz is NOT an archive file.
07-04-2014 15:36:28.866 +0000 DEBUG ArchiveProcessor -   Basing small file initCrc on path=/tmp/indexme/test2-standard-gz-file.
07-04-2014 15:36:28.866 +0000 DEBUG ArchiveProcessor -   This archive stream's initcrc=0xcea312efbb705310.
07-04-2014 15:36:28.866 +0000 DEBUG ArchiveProcessor -   Found no initcrc match for this stream, will re-read entire file.
07-04-2014 15:36:28.866 +0000 DEBUG ArchiveContext - Early termination because write has false notification
07-04-2014 15:36:28.867 +0000 DEBUG ArchiveContext - /tmp/indexme/test2-standard-gz-file.gz is NOT an archive file.
07-04-2014 15:36:28.878 +0000 INFO  databasePartitionPolicy - idx=main Creating hot bucket=hot_v1_3, given event timestamped=1404488177
07-04-2014 15:36:28.879 +0000 INFO  DatabaseDirectoryManager - Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/defaultdb/db'.  Reason='Adding bucket, bid=main~3~BCB6BEA7-CE1C-486C-BA49-07E40BAB59B1'
07-04-2014 15:36:28.880 +0000 DEBUG ArchiveProcessor - fname=/tmp/indexme/test2-standard-gz-file.gz fishstate=key=0x72a9c05c9c36e767 sptr=1344 scrc=0xb4910955c707a2a2 fnamecrc=0x84c51a934e8fdaab modtime=0
07-04-2014 15:36:28.880 +0000 INFO  ArchiveProcessor - Finished processing file '/tmp/indexme/test2-standard-gz-file.gz', removing from stats

Final thoughts

It appears as though the ArchiveContext <FILE> is NOT an archive file is bogus message since it appears for all archive files; the ones that work and the ones that don't.

The content of the two test files were created using a simple shell command for testing only.

( echo "`date` This is a test event ... ;  dd if=/dev/urandom of=/dev/stdout count=1 | xxd ) | gzip -c - > /tmp/indexme/<SOME_FILE_NAME_ENDING_IN_GZ_OR_GZ2>

I'm testing this on Splunk 6.1.1 (splunk-6.1.1-207789.x86_64 rpm) running on CentOS 6.4 using an AWS t1.micro instance.

Update:

I changed my extension from .gz2 to .xy (randomly chosen) just to make sure I wasn't some hitting the wrong [source:*] stanza. No change.

I added the following to my local props.conf just to prove to myself that Splunk is actually calling the unarchive_cmd


[source::....(?>!tar.)gz(.\d+)?]
unarchive_cmd = gzip -cd - | sed -re 's/secret/HIDDEN/g'
And this does work. I also tried this with an invalid command, and a typo in the sed command and all of those issues were logged in splunkd.log.
07-04-2014 16:26:15.102 +0000 ERROR ArchiveContext - From archive='/tmp/indexme/test5-standard-gz-file-with-wacky-archivecommandoverride.gz':  /bin/sh: NOT_A_VALID_COMMAND: command not found
07-04-2014 16:27:11.112 +0000 ERROR ArchiveContext - From archive='/tmp/indexme/test5-standard-gz-file-with-wacky-archivecommandoverride.gz':  sed: -e expression #1, char 9: unknown command: `*'

I feel like this should work without me having arbitrarily add a .gz to the end of my file names.

rkirkw
Path Finder

I was able to make it work with these files:

I am converting binary data to ascii data with a perl script as the unarchive_cmd

inputs.conf

[monitor:///var/log/huawei/*.cdr]
disabled = false
followTail = 0
host = huawei_switch
index = huawei
sourcetype = huawei_cdr

props.conf

[source::/var/log/huawei/*.cdr]
invalid_cause = archive
unarchive_cmd = /usr/local/bin/huawei2text.pl
sourcetype = huawei_cdr
NO_BINARY_CHECK=true

Caveat 1:

The props.conf file would not work if I used sourcetype [huawei_cdr] instead of the [source::/var/log/huawei/*.cdr] listed above.

Caveat 2:

It would not work with a "known" file type such as .dat or .txt
I found this out by running splunk in debug mode and it set the priority for it's understanding of a .dat file higher than what I was telling it in my props and inputs files. I changed the extention to a .cdr file and it was able to execute the unarchive_cmd and index the binary files properly.

Hope this helps!