duplicate events: unarchive_cmd gets passed whole ...

stu2 · ‎10-04-2014

Docs make it look like CHECK_METHOD = endpoint_md5 in props.conf should tell Splunk to only sends deltas. But anytime the source file changes it's getting the whole file, and Splunk's creating duplicates of any previously indexed data.

Is this related to priority? If I don't set priority in props.conf (see below) my unarchive_cmd doesn't get run. If I DO set it as below, it runs BUT I get duplicates.

Here's my inputs.conf

[monitor:///Users/stu/projects/splunk/compressed/txnsummaries.txt]
sourcetype = txn_summaries_st

props.conf

[source::/Users/stu/projects/splunk/compressed/txnsummaries.txt]
invalid_cause = archive
unarchive_cmd = python /Users/stu/projects/splunk/scripts/timings_filter.py
sourcetype = txn_summaries_st
NO_BINARY_CHECK = true
priority = 10
CHECK_METHOD = endpoint_md5

unarchive_cmd is otherwise doing what I'm looking for. I'm taking a single event containing many batched/compressed transaction timing records, and breaking each record up into it's own event. Splunk is correctly seeing the events I'm sending. However whenever the source file changes I get duplicates of previously indexed events.

Hoping this isn't a limitation of unarchive_cmd.

Any ideas?

Thanks in advance

jrodman · ‎10-04-2014

I'm not really sure what you're doing here with unarchive_cmd, but it was built to handle things like gzip files. Those cannot work without getting the entire file.

Whether Splunk can do content tracking and acquiring of new records from the output of customer unarchive commands is a question I don't know the answer to. It's certainly not documented functionality if it's possible.

The usual tools to build an input like this are scripted or modular inputs, but that puts the bookmark-tracking problem squarely on the input script.

The easiest solution is to just process your compressed data into uncompressed logs ahead of time,

stu2 · ‎10-05-2014

I agree, it would be far easier just to emit the logs with all contents uncompressed. But for performance reasons that's not a great option for our production environment. We periodically emit from a background thread a single log event with a field containing gzip'd and base64 encoded collection of JSON records. Output looks something like this.

2014-10-04 11:37:27 [pool-2-thread-1] INFO :  TxnSummaryLogWriter.logTiming JSON_GZIP_TXN_SUMMARIES  [[[H4sIAAAAAAAAAOWYS09jRxBG/wq6a9eon9XV3hGwNBPNAGKMFCkaoeruaoQCJjImioT47ynCLHPhLry5zs6Pbln36NTj8/Ow+3vz+HR/z9tbeRyWvz8P......yG+8XHwAA]]]

As these events are written to log, we'd like to get the deltas and unpack the contents to create an event per txn that gets sent to Splunk. I have a simple python script that does this, I just can't seem to figure out how to plug into Splunk's ingest pipeline to run this only on deltas.

Scripted inputs would work, but I don't want to have to worry about identifying deltas in my script - Splunk's great at that. And waiting until the log rolls over means an unwanted delay in getting the info into Splunk.

duplicate events: unarchive_cmd gets passed whole file, not just the delta since the last change

.conf25 Global Broadcast: Don’t Miss a Moment

Observe and Secure All Apps with Splunk

What's New in Splunk Observability - August 2025

Are you a member of the Splunk Community?

duplicate events: unarchive_cmd gets passed whole file, not just the delta since the last change

.conf25 Global Broadcast: Don’t Miss a Moment

Observe and Secure All Apps with Splunk

What's New in Splunk Observability - August 2025