I've created a custom command in python that needs to view an entire set of events as a single batch, because it's comparing subsequent events. Unfortunately, Splunk is sending events to the custom command in chunks of <= 50,000 events. The commands.conf has streaming = false. Setting runinpreview = false only changes the way the results are displayed, as expected.
In case it's relevant, the command is running on a search head which receives events from several distributed search nodes.
Here's the basic code -- run() is invoked by a minimal plugin "manager":
class RemoteLogins( SplunkPlug ): def run( self, events, keywords, options ): out_events =  if not events: intersplunk.outputResults( out_events ) return now = datetime.now() with open( "/opt/splunk/var/log/test.log", "a" ) as f: f.write( "Running at %s with %s events\n" % ( now, len( events ) ) ) for related_events in self.related( events ): self.find_overlap( related_events, out_events ) with open( "/opt/splunk/var/log/test.log", "a" ) as f: f.write( "Ending %s with %s results\n" % ( now, len( out_events ) ) ) intersplunk.outputResults( out_events )
When invoked by a single splunk search, these results are generated:
Running at 2011-08-27 16:56:18.619245 with 25 events Ending 2011-08-27 16:56:18.619245 with 0 results Running at 2011-08-27 16:56:19.078111 with 2942 events Ending 2011-08-27 16:56:19.078111 with 0 results Running at 2011-08-27 16:56:20.900458 with 19980 events Ending 2011-08-27 16:56:20.900458 with 1 results Running at 2011-08-27 16:56:31.590848 with 50000 events Ending 2011-08-27 16:56:31.590848 with 4 results Running at 2011-08-27 16:56:55.376255 with 50000 events Ending 2011-08-27 16:56:55.376255 with 3 results
Once the search is complete, only the 3 results from the last batch of events is shown.
For completeness, here's commands.conf:
[py] type = python filename = py.py streaming = false run_in_preview = false maxinputs = 0
So, is there any way aside from the settings in commands.conf to really convince Splunk not to stream events into a custom command? Maybe an intermediate command I could insert into the pipeline?
Well, either someone else can spot what's missing or can confirm that it's a bug, but for the time being an easy way to make sure no streaming events make it to your command is just to put a non-streaming command in front of it.
`<your search> | table * | py`
should do it.
Adding the non-streaming command does keep splunk from sending multiple chunks of events to the custom script. Unfortunately, only the last 50k events are sent. Since I've asked for unlimited inputs, this is pretty sure to be a bug.
It didn't work for me. I am using a dedup command in my search and the search scanned roughly 4000 events and the result set size is only 11 events.
scanCount numbers below 5000 or 10000 can be pretty misleading. Splunk will pretty much always scan at least that deep into any search before potentially shutting down the stream, because that's the sort of "chunk" size that the search process uses when talking to the index. Or so I understand.
mute_dammit: yea once you're out of the streaming portion I'm afraid 50,000 is the default in limits.conf. It can be changed although it's to be filed under "do at your own risk"...
If you're writing custom search commands, the update is the python sdk offering significant support for doing so, which should enable you to work with the model without a lot of difficulty. The fundamental behavior of the interaction hasn't changed to my knowledge.
Splunk did some work on long-running python processes a few releases ago, but I don't think we "leveraged" it for search commands.
That behavior doesn't seem right to me, but streaming=false was never intended to cause splunk to deliver all the events regardless of event quantity to the search command. To my understanding, it is supposed to influence how the search machinery thinks, and encourage it to only give one chunk to the search command.
Essentially, you could view this flag as "I'm only designed for small datasets".
In order to make your tool work over large datasets, you'll want to be streaming, and you'll want to be able to handle the data chunk by chunk.
For some problems that opens up an entire new topic about how you can efficiently store your state, and is it valid to emit nothing until the last call, and how do you know when it's the last call..
Hi, Any idea how we can determine which is the last call? So that we can collate all the results? so that we emit nothing until the last call?