I have a custom search command which uses the streaming API to retrieve query results. Here's a snippet:
results = csv.DictReader(sys.stdin) for r in results: resultsFile.write(str(r['_raw']+'\n'))
The problem is I want to operate on the full set of results when the streaming has completed (perform a POST on everything). But how can I/the script tell Splunk is done streaming events?
Have a look at commands.conf, specifically I imagine the
streaming configuration parameter should be of interest to you.
streaming = [true|false] * Specify whether the command is streamable. * Defaults to false.
Is there are reason for not using the Intersplunk methods for getting results? I can honestly say I don't know the specifics of the different methods too well but I imagine your search command would be better at playing nice if using the Intersplunk methods.
Nothing in the protocol to the search commands tells it when it is done.
a misconception about streaming commands is that if you define it as streaming, Splunk will invoke it once, and then stream input events to that invocation, and asynchronously receive a stream of output from it. This is not the case. In fact in current implementations, the script will be called multiple times, each with some chunk of inputs events, and Splunk will expect each invocation to produce correct output for that chunk of input.
specifying that a search command is streaming lets Splunk know that it's okay to do this, i.e., that the script will produce correct results if input it given to it incrementally in any size set, that it will produce results corresponding to the increment, and that if input terminates at any time, the results produced up to that point are complete and accurate, and that therefore it's okay to just call it multiple times with incremental chunks of input. This boils down to saying that your command can work on a single event in isolation, without context of prior or subsequent events.
Given this, Splunk does not expect to need to let your script know when it is done with sending data, since for the purposes of collecting the output of your script, it should not matter.
Unfortunately, non-streaming commands are limited to a single invocation of the script with a limit of 50,000 events. There are certainly cases where you'd want to be able to have non-streaming commands that can handle more than 50k events, and certainly there are cases where you'd want your script be able to receive the entire input as a single stream and produce results asynchronously (as some internal Splunk commands can do) but I believe the current custom search command interfaces don't allow this.
Thinking about it a bit more, a solution might be for you to make your command into a streaming "preop" command. Then, create a non-streaming command that requires and uses your streaming preop command, and have your final POST take place in the non-streaming command after the end. Then you would only call the non-streaming command, which would in turn call your preop.
Please be aware of the effect that non-streaming commands have on map-reduce/distributed queries. The non-streaming command and anything in the search pipeline after it is not map-reduced, but run only on the search head. (The streaming preop would be distributed, as long as everything before it is streaming.)
Thank you, Gerald. I thought to try the same thing morning and will test and post results here. We are using the streaming command precisely because Intersplunk limits the amount of data returned.