Getting Data In

duplicate events: unarchive_cmd gets passed whole file, not just the delta since the last change

stu2
Explorer

Docs make it look like CHECK_METHOD = endpoint_md5 in props.conf should tell Splunk to only sends deltas. But anytime the source file changes it's getting the whole file, and Splunk's creating duplicates of any previously indexed data.

Is this related to priority? If I don't set priority in props.conf (see below) my unarchive_cmd doesn't get run. If I DO set it as below, it runs BUT I get duplicates.

Here's my inputs.conf

[monitor:///Users/stu/projects/splunk/compressed/txnsummaries.txt]
sourcetype = txn_summaries_st

props.conf

[source::/Users/stu/projects/splunk/compressed/txnsummaries.txt]
invalid_cause = archive
unarchive_cmd = python /Users/stu/projects/splunk/scripts/timings_filter.py
sourcetype = txn_summaries_st
NO_BINARY_CHECK = true
priority = 10
CHECK_METHOD = endpoint_md5

unarchive_cmd is otherwise doing what I'm looking for. I'm taking a single event containing many batched/compressed transaction timing records, and breaking each record up into it's own event. Splunk is correctly seeing the events I'm sending. However whenever the source file changes I get duplicates of previously indexed events.

Hoping this isn't a limitation of unarchive_cmd.

Any ideas?

Thanks in advance

0 Karma

jrodman
Splunk Employee
Splunk Employee

I'm not really sure what you're doing here with unarchive_cmd, but it was built to handle things like gzip files. Those cannot work without getting the entire file.

Whether Splunk can do content tracking and acquiring of new records from the output of customer unarchive commands is a question I don't know the answer to. It's certainly not documented functionality if it's possible.

The usual tools to build an input like this are scripted or modular inputs, but that puts the bookmark-tracking problem squarely on the input script.

The easiest solution is to just process your compressed data into uncompressed logs ahead of time,

0 Karma

stu2
Explorer

I agree, it would be far easier just to emit the logs with all contents uncompressed. But for performance reasons that's not a great option for our production environment. We periodically emit from a background thread a single log event with a field containing gzip'd and base64 encoded collection of JSON records. Output looks something like this.

2014-10-04 11:37:27 [pool-2-thread-1] INFO :  TxnSummaryLogWriter.logTiming JSON_GZIP_TXN_SUMMARIES  [[[H4sIAAAAAAAAAOWYS09jRxBG/wq6a9eon9XV3hGwNBPNAGKMFCkaoeruaoQCJjImioT47ynCLHPhLry5zs6Pbln36NTj8/Ow+3vz+HR/z9tbeRyWvz8P......yG+8XHwAA]]] 

As these events are written to log, we'd like to get the deltas and unpack the contents to create an event per txn that gets sent to Splunk. I have a simple python script that does this, I just can't seem to figure out how to plug into Splunk's ingest pipeline to run this only on deltas.

Scripted inputs would work, but I don't want to have to worry about identifying deltas in my script - Splunk's great at that. And waiting until the log rolls over means an unwanted delay in getting the info into Splunk.

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...