Developing for Splunk Enterprise

Are custom search commands truly 'streaming'?

Path Finder

I've built a custom search command for unescaping values from URL strings.

This is its commands.conf entry:

[unescape]
filename = unescape.py
streaming = true

However, I'm not convinced that this is actually 'streaming' per se. This is how I've built the script, per all the documentation I could find:

import csv, sys
results = csv.DictReader(sys.stdin)
for row in results:
  # do some work on the row
splunk.Intersplunk.outputResults(results)

Obviously this isn't using a generator pattern - the loop over results will block until it's gone over everything in stdin, and only when the pipe is closed and input to the script is finished will it output anything.

I understand that the reference to streaming really just means that the script doesn't have to operate over the whole data set - that it's a map function, essentially. That's fine, but I'm seeing some serious performance issues that I think may be related to my script blocking while a huge data set is copied into memory in the Python process and held there for a while to do some trivial work, meanwhile the rest of the search pipeline has seemingly grind to a halt as far as I can tell.

Is it possible to output results as they arrive and are processed? Can I output the well-formed CSV data within my loop over the data set, so that I can use a generator pattern? Will Splunk see that data coming out and work with it as it comes, or will it still wait for my script to exit before moving to the next stage in the pipeline?

1 Solution

Splunk Employee
Splunk Employee

No, they are not streaming in the sense that you see. Rather, what happens is the script will be invoked multiple times, synchronously waiting for the output, with each invocation receiving up to 50k events. This is governed by maxinputs in commands.conf, so if you're having memory issues on a single invocation, trying lowering maxinputs. (Note that maxinputs won't ever exceed maxresultrows in limits.conf.)

However, Splunk will move data down the search pipeline after it gets results from each invocation, though, so you don't have to wait for all results from every invocation to start seeing results from some.

I guess "streaming" more indicates to Splunk that your script accepts and produces "streamable" data, which basically means that it can operate and produce results on each event independently of other events. It does not mean that your script gets a stream of input and that it can produce an asynchronous stream of output.

It would be really awesome if you could use a generator though, I agree.

View solution in original post

Path Finder

In case anyone is interested, I've subsequently reimplemented the Intersplunk.py library with a truly 'streaming' interface. I don't know what kind of buffering behaviour Splunk uses in reading the script's stdout, so it may not cause data to propagate further down the pipeline any more quickly, but it definitely significantly improves the performance and memory profile of the script itself. All you need to do is read the headers and the first line to get the fieldnames then use a generator function to yield a line at a time to a function handles a single line, then output it to standard out. Obviously this doesn't work with commands that need to look at the whole data set - but it does work with commands that reduce or increase the number of events.

This is the API for it, basically:

import newSplunk

def handler(line):
  # apply business logic here
  return line

newSplunk.process_splunk(handler_fn = handler)

About a day's work and highly performant, easy custom commands with no need to mess around with manually outputting headers and all that business. This makes custom Python commands far more viable, and not significantly slower than internal / native search commands!

Super Champion

Is this something you'd be willing to post?

Path Finder

I would also really like to see that implementation... any chance you could open source this?

0 Karma

Splunk Employee
Splunk Employee

No, they are not streaming in the sense that you see. Rather, what happens is the script will be invoked multiple times, synchronously waiting for the output, with each invocation receiving up to 50k events. This is governed by maxinputs in commands.conf, so if you're having memory issues on a single invocation, trying lowering maxinputs. (Note that maxinputs won't ever exceed maxresultrows in limits.conf.)

However, Splunk will move data down the search pipeline after it gets results from each invocation, though, so you don't have to wait for all results from every invocation to start seeing results from some.

I guess "streaming" more indicates to Splunk that your script accepts and produces "streamable" data, which basically means that it can operate and produce results on each event independently of other events. It does not mean that your script gets a stream of input and that it can produce an asynchronous stream of output.

It would be really awesome if you could use a generator though, I agree.

View solution in original post

Path Finder

If you need to implement an external command using Python generators, look into the Splunk SDK for Python: http://dev.splunk.com/view/python-sdk/SP-CAAAEU2

0 Karma

Engager

@ineeman How would you do that? just put splunk.Intersplunk.outputResults(results) in the for loop?

0 Karma

Splunk Employee
Splunk Employee

dbryan, one thing you could do is output things at once, rather than waiting until you are done. So rather than looping over everything and then outputing to stdout, you can just output immediately in each iteration. That should make it more streaming in the normal sense.

Path Finder

Yeah, I guess I was just confused by the use of the word 'streaming'.

It's not so much that I'm getting discrete memory issues as an overall big performance hit. I'm sure that some of this is due to the overhead of a combination of spawning the Python VM and having it have to parse up to 50k rows of results with a DictReader before it can output anything at all. Maybe I'm underestimating how much Python would fundamentally slow things like this down. Given what I'm seeing, I don't see how anyone can use Python search commands in production. I wonder if I could somehow get it to use PyPy..

State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!