Getting Data In

duplicate events: unarchive_cmd gets passed whole file, not just the delta since the last change


Docs make it look like CHECK_METHOD = endpoint_md5 in props.conf should tell Splunk to only sends deltas. But anytime the source file changes it's getting the whole file, and Splunk's creating duplicates of any previously indexed data.

Is this related to priority? If I don't set priority in props.conf (see below) my unarchive_cmd doesn't get run. If I DO set it as below, it runs BUT I get duplicates.

Here's my inputs.conf

sourcetype = txn_summaries_st


invalid_cause = archive
unarchive_cmd = python /Users/stu/projects/splunk/scripts/
sourcetype = txn_summaries_st
priority = 10
CHECK_METHOD = endpoint_md5

unarchive_cmd is otherwise doing what I'm looking for. I'm taking a single event containing many batched/compressed transaction timing records, and breaking each record up into it's own event. Splunk is correctly seeing the events I'm sending. However whenever the source file changes I get duplicates of previously indexed events.

Hoping this isn't a limitation of unarchive_cmd.

Any ideas?

Thanks in advance

0 Karma

Splunk Employee
Splunk Employee

I'm not really sure what you're doing here with unarchive_cmd, but it was built to handle things like gzip files. Those cannot work without getting the entire file.

Whether Splunk can do content tracking and acquiring of new records from the output of customer unarchive commands is a question I don't know the answer to. It's certainly not documented functionality if it's possible.

The usual tools to build an input like this are scripted or modular inputs, but that puts the bookmark-tracking problem squarely on the input script.

The easiest solution is to just process your compressed data into uncompressed logs ahead of time,

0 Karma


I agree, it would be far easier just to emit the logs with all contents uncompressed. But for performance reasons that's not a great option for our production environment. We periodically emit from a background thread a single log event with a field containing gzip'd and base64 encoded collection of JSON records. Output looks something like this.

2014-10-04 11:37:27 [pool-2-thread-1] INFO :  TxnSummaryLogWriter.logTiming JSON_GZIP_TXN_SUMMARIES  [[[H4sIAAAAAAAAAOWYS09jRxBG/wq6a9eon9XV3hGwNBPNAGKMFCkaoeruaoQCJjImioT47ynCLHPhLry5zs6Pbln36NTj8/Ow+3vz+HR/z9tbeRyWvz8P......yG+8XHwAA]]] 

As these events are written to log, we'd like to get the deltas and unpack the contents to create an event per txn that gets sent to Splunk. I have a simple python script that does this, I just can't seem to figure out how to plug into Splunk's ingest pipeline to run this only on deltas.

Scripted inputs would work, but I don't want to have to worry about identifying deltas in my script - Splunk's great at that. And waiting until the log rolls over means an unwanted delay in getting the info into Splunk.

0 Karma
Get Updates on the Splunk Community!

Observability Highlights | January 2023 Newsletter

 January 2023New Product Releases Splunk Network Explorer for Infrastructure MonitoringSplunk unveils Network ...

Security Highlights | January 2023 Newsletter

January 2023 Splunk Security Essentials (SSE) 3.7.0 ReleaseThe free Splunk Security Essentials (SSE) 3.7.0 app ...

Platform Highlights | January 2023 Newsletter

 January 2023Peace on Earth and Peace of Mind With Business ResilienceAll organizations can start the new year ...