I've built a custom search command for unescaping values from URL strings.
This is its commands.conf entry:
[unescape] filename = unescape.py streaming = true
However, I'm not convinced that this is actually 'streaming' per se. This is how I've built the script, per all the documentation I could find:
import csv, sys results = csv.DictReader(sys.stdin) for row in results: # do some work on the row splunk.Intersplunk.outputResults(results)
Obviously this isn't using a generator pattern - the loop over results will block until it's gone over everything in
stdin, and only when the pipe is closed and input to the script is finished will it output anything.
I understand that the reference to
streaming really just means that the script doesn't have to operate over the whole data set - that it's a map function, essentially. That's fine, but I'm seeing some serious performance issues that I think may be related to my script blocking while a huge data set is copied into memory in the Python process and held there for a while to do some trivial work, meanwhile the rest of the search pipeline has seemingly grind to a halt as far as I can tell.
Is it possible to output results as they arrive and are processed? Can I output the well-formed CSV data within my loop over the data set, so that I can use a generator pattern? Will Splunk see that data coming out and work with it as it comes, or will it still wait for my script to exit before moving to the next stage in the pipeline?
No, they are not streaming in the sense that you see. Rather, what happens is the script will be invoked multiple times, synchronously waiting for the output, with each invocation receiving up to 50k events. This is governed by
maxinputs in commands.conf, so if you're having memory issues on a single invocation, trying lowering
maxinputs. (Note that
maxinputs won't ever exceed
maxresultrows in limits.conf.)
However, Splunk will move data down the search pipeline after it gets results from each invocation, though, so you don't have to wait for all results from every invocation to start seeing results from some.
I guess "streaming" more indicates to Splunk that your script accepts and produces "streamable" data, which basically means that it can operate and produce results on each event independently of other events. It does not mean that your script gets a stream of input and that it can produce an asynchronous stream of output.
It would be really awesome if you could use a generator though, I agree.
Yeah, I guess I was just confused by the use of the word 'streaming'.
It's not so much that I'm getting discrete memory issues as an overall big performance hit. I'm sure that some of this is due to the overhead of a combination of spawning the Python VM and having it have to parse up to 50k rows of results with a DictReader before it can output anything at all. Maybe I'm underestimating how much Python would fundamentally slow things like this down. Given what I'm seeing, I don't see how anyone can use Python search commands in production. I wonder if I could somehow get it to use PyPy..
dbryan, one thing you could do is output things at once, rather than waiting until you are done. So rather than looping over everything and then outputing to stdout, you can just output immediately in each iteration. That should make it more streaming in the normal sense.
In case anyone is interested, I've subsequently reimplemented the Intersplunk.py library with a truly 'streaming' interface. I don't know what kind of buffering behaviour Splunk uses in reading the script's stdout, so it may not cause data to propagate further down the pipeline any more quickly, but it definitely significantly improves the performance and memory profile of the script itself. All you need to do is read the headers and the first line to get the fieldnames then use a generator function to yield a line at a time to a function handles a single line, then output it to standard out. Obviously this doesn't work with commands that need to look at the whole data set - but it does work with commands that reduce or increase the number of events.
This is the API for it, basically:
import newSplunk def handler(line): # apply business logic here return line newSplunk.process_splunk(handler_fn = handler)
About a day's work and highly performant, easy custom commands with no need to mess around with manually outputting headers and all that business. This makes custom Python commands far more viable, and not significantly slower than internal / native search commands!