Developing for Splunk Enterprise

Do custom search commands have worse performance than external lookup scripts?

Communicator

I wrote two versions of the same Python streaming command: one as a simple external lookup script, and one as a full custom search command (using V2 of the custom search command protocol). I tested the performance of both commands, and found that the external lookup script was much faster.... which is highly counter-intuitive.

Why might this be? Is there a reason custom search commands could actually be slower than equivalent external lookup scripts?

Here is a break-down of the two command versions.

1. External Lookup Script
Loads geoip database into memory with MEMORY_CACHE flag
Uses the csv module to read events from std_in
Performs a geoip lookup on each event's ip field, stores result in new field
Writes each line (event) back to stdout using csv module

2. Custom Search Command with V2 Protocol
Loads geoip database and defines custom streaming command like so:

#create GeoIP instance with Memory Cache
geoip_db = pygeoip.GeoIP(ISP_DB_PATH, pygeoip.const.MEMORY_CACHE)

@Configuration()
class ipasnCommand(StreamingCommand):

    def stream(self, events):
        #transform each event in the chunk
        for event in events:
                     ...... [lookup logic goes here]
                     yield event

dispatch(ipasnCommand, sys.argv, sys.stdin, sys.stdout, __name__)

Note that both command versions are written in Python, use the same geoip lookup library with the same caching flag, and make the same lookup function calls.

Also note that while the custom streaming command is only dispatched/invoked once and events are passed in chunks, Splunk seems to re-invoke the external lookup script every 255 events . . . which means the geoip database gets reloaded and caching is wiped out, leading one to hypothesize that the external lookup version should perform much worse.

However, multiple trials confirm that when given 1 million events to process, the custom search command takes an average of 00:09:10, while the external lookup can do it in 00:07:06.

I was disappointed to observe such a large performance deficit from the custom search command, despite all the supposed advantages. Does anyone have some insight into what could be causing this? Is this performance gap to be expected?

1 Solution

SplunkTrust
SplunkTrust

Have you considered just using a .csv CIDR-based lookup? With a tiny bit of limits.conf memory tweaking, that .csv will fit in memory and will distribute to indexers.
A quick test on my laptop:

| makeresults count=1000000 | eval record_ip = random() | eval record_ip = record_ip%256 . ".". floor(record_ip/256)%256 . ".". floor(record_ip/256/256)%256 . ".". floor(record_ip/256/256/256)%256
| lookup ip2asn cidr as record_ip | stats count max(asn) min(asn)

22.16    command.lookup

22 seconds for 1M lookups, single splunk instance, dual-core windows laptop.
I used this database for testing: http://lite.ip2location.com/database/ip-asn - the cidr, asn, as fields together are 28MB. If you can do without the as that'd shrink considerably again.

transforms.conf
[ip2asn]
filename = IP2LOCATION-LITE-ASN.csv
match_type = cidr(cidr)

Edit: On a fast ubuntu box, 1M records need 13s for the lookup, 5M records need 50s.

View solution in original post

SplunkTrust
SplunkTrust

I used the free IPv4 CSV button on the page I linked to, plain zipped CSV file. All I did was trim ip_from and ip_to fields because Splunk can work directly off the cidr.

0 Karma

SplunkTrust
SplunkTrust

Have you considered just using a .csv CIDR-based lookup? With a tiny bit of limits.conf memory tweaking, that .csv will fit in memory and will distribute to indexers.
A quick test on my laptop:

| makeresults count=1000000 | eval record_ip = random() | eval record_ip = record_ip%256 . ".". floor(record_ip/256)%256 . ".". floor(record_ip/256/256)%256 . ".". floor(record_ip/256/256/256)%256
| lookup ip2asn cidr as record_ip | stats count max(asn) min(asn)

22.16    command.lookup

22 seconds for 1M lookups, single splunk instance, dual-core windows laptop.
I used this database for testing: http://lite.ip2location.com/database/ip-asn - the cidr, asn, as fields together are 28MB. If you can do without the as that'd shrink considerably again.

transforms.conf
[ip2asn]
filename = IP2LOCATION-LITE-ASN.csv
match_type = cidr(cidr)

Edit: On a fast ubuntu box, 1M records need 13s for the lookup, 5M records need 50s.

View solution in original post

Communicator

Nice, I'm running the same thing on one of my large datasets right now, and the results are looking pretty good. Even the premium MaxMind databases were coming up with misses for about 1/3 of the IP's I check, and I'm seeing similar results from the IP2Location database, but with better performance. And it distributes just like it's supposed to!

Shall we transform one of the comments in this chain into a formal answer to the original question? I think we've pretty much concluded that yes, custom search commands do have worse performance, due to the implementation differences between the custom command protocols and the dynamic lookup protocols. Primarily:

  • External lookups can assume idempotence
  • Custom Search Commands are limited to one streaming process
  • External lookups are free to spin up many processes in parallel

SplunkTrust
SplunkTrust

Duly converted 😄

0 Karma

Communicator

Very cool! I had considered testing out ip2location a while ago, but already had a copy of the premium MaxMind databases and wanted to exhaust that option first. Did you use the Lite version of ip2location or the commercial version in your tests above? And which database, by the way? It looks like there is a free Lite IP-ASN database, but the closest thing I can find in the premium section is "IP-Country-ISP Database"... and it's unclear exactly what fields that would contain.

0 Karma

Communicator

By the way, are we clear on whether the dynamic/external lookup protocol is creating new processes or new threads? From what you mentioned about Python overhead on Windows, I'm assuming it's creating separate full Python processes for each chunk of data? Is there any way we could safely write a custom search command using the V2 protocol that uses threads instead of processes? Or maybe a specific number of processes that all stay alive and get eventset chunks via IPC in a round-robin fashion?

Depending on the answer to that question, now that we know all this, what do you think the bottom line is for a scenario where you need to perform hundreds of millions of ASN lookups in order to build a set of machine learning model features? Specifically, one of the features is distinct ASN count by domain, and there could be over 10 million domains, with over 200 million total events. Assume the environment is capable of distributed search over 5-10 indexers.

Would you go with the process-spam lookup method or the slow-as-death custom search command protocol? Any concerns about eating up too much of the host resources if you know the indexers are pretty busy already? How about if the environment is all Windows? Even possible then with the extra Python process creation overhead?

So I guess you could say the real question now is: is any algorithm like this going to be truly scalable if there is no way to do all of the following:

  • Be compatible with distributed search
  • Spin up a specific number of processes on each node and keep them all alive for geoip caching purposes
  • Pass chunks of data through each process on each host via IPC in a similar manner to the single-threaded V2 protocol

Would you even attempt to deploy an analytic that requires this many ASN lookups for a large enterprise if those conditions listed above can't be satisfied?

0 Karma

SplunkTrust
SplunkTrust

So let's establish some invariants.

  1. The custom command protocol - whether V1 or V2 - must pass all search results with all fields and values out to the custom command and back again. Because Splunk cannot predict what your custom command is going to do, this is necessary. If you are passing several thousand results out and back, this could be a substantial amount of data.
  2. External lookups can (and do) pass less information out to the external command and back. Splunk knows this is a lookup use case and - by definition - a lookup should be idempotent. Think of a lookup as a function f(x,y,z) for field values x, y, and z - the lookup should always return the same result for the same three values of x, y, and z. Not only are the inputs to an external lookup reduced to a distinct set of values for the input fields (and only the input fields) -- the outputs can be cached. Yes, Splunk will start more processes - probably due in part to preview during the display of events.
  3. Threads vs processes is a non-starter. Your code does not run within Splunk's address space - you have a process boundary between you and the search process for a reason. Any improvement to the interaction between external enrichment features - be they lookups, or custom commands - is going to require engineering changes in Splunk which will take time and analysis.

I don't know what problem you're trying to solve, but I think the answer to your question is not "which of these is fastest" but "which is the most accurate?". If your relationship between your inputs and your outputs is truly idempotent, then dynamic lookups are your most correct answer. If it's not idempotent - or you need every result row - then a custom search command may be the better choice.

SplunkTrust
SplunkTrust

HAH!!

...my random IP generator doesn't appear to generate random IPs O_O

...but this does:

| makeresults count=50000 | eval record_ip = random() | eval record_ip = record_ip%256 . ".". floor(record_ip/256)%256 . ".". floor(record_ip/256/256)%256 . ".". floor(record_ip/256/256/256)%256
| lookup ip_asn_dat_lookup ip as record_ip OUTPUT asn as record_asn asn_error as record_asn_error debug
| stats count dc(debug) max(record_asn) min(record_asn)

Now I see the python process spawning behaviour as expected, and a terribly slow lookup... 80 seconds for 50k rows, matching the massive python spawning penalty on windows.

SplunkTrust
SplunkTrust

Well based on my old windows box, the lookup "spam method" is not only faster but also uses fewer cpu cycles.

0 Karma

SplunkTrust
SplunkTrust

Confession, TL;DR ... I didn't read all the above. So if you've covered this, then my apologies.

One big difference in the "dynamic lookup" protocol and the "custom command" protocol is that Splunk knows one is doing a lookup and the other might not be. A dynamic lookup will be handed on stdin basically a values(input_fields) one entry for all distinct combinations of input fields. Splunk can then take the dynamic lookup's output and apply it at needed to the result rows.

A custom command has to be handed ALL of the result rows regardless - because Splunk can't guarantee what the custom command is going to do.

Also don't forget the impact of file cache on restarting the lookup child process more than once. That could make a positive performance difference.

Communicator

Nice analysis.

I agree with your cookie distinct count of 2 for the above query, but my actual search query is still seeing a huge distinct count:

index="scale_med"
| rex "^dns,(?<dns_server>\S+),(?<log_file>\S+),(?<log_type>\S+),.*:\s+(?<timestamp>\S+\s+\S+)\s+client\s+(?<client_ip>\S+)#(?<client_port>\d+).*query:\s+(?<domain>\S+)\s+IN\s+(?<query_type>\S+)\s+response:\s+(?<response_code>\S+)\s+[+-]\S*,?\s?(?<raw_response>.*)" 
| where response_code="NOERROR" and query_type="A" 
| rex field=raw_response max_match=1000 "(?<record_name>\S+)\.\s+(?<ttl>\S+)\sIN\s(\w\s)?(?<record_type>(A))\s(?<record_ip>[^\s;]+)(;\s)?" 
| where isnotnull(record_ip) 
| eval _time = strptime(timestamp,"%d-%b-%Y %H:%M:%S.%f") 
| fields client_ip, domain, response_code, record_name, ttl, record_ip, eventtype
| eval zip=mvzip(record_name,mvzip(ttl,record_ip))
| mvexpand zip 
| rex field=zip "(?<record_name>[^,]*),(?<ttl>[^,]*),(?<record_ip>.*)" 
| fields - zip
| head 500000
| lookup ipasn_dat_lookup ip as record_ip OUTPUT asn as record_asn asn_error as record_asn_error cookie local=true
| stats count dc(cookie) max(record_asn) min(record_asn)

count   dc(cookie)  max(record_asn) min(record_asn)
500000  1987    393342  -2

I wonder what about these two search pipelines is causing such different behavior?

I believe there may be some JSON involved in the V2 protocol's IPC, and the documentation indicates that it spins up one single process and streams all the events through. Clearly, this differs from the external lookup method, where we're seeing what looks like multiple processes. Would you agree that there should be one distinct cookie per process? If that is true, the external lookup for my query above must be creating tons and tons of processes for some reason(?)

All in all, I knew that the MaxMind lookup would be the most taxing part of the custom scripts. I had assumed that the true streaming interface of the V2 protocol would help me take advantage of the MaxMind API's GeoIP caching flag, but would you conclude that the external lookup's manner of spamming the system with new processes every 255 events is what happens to get the job done faster, even without caching?

0 Karma

SplunkTrust
SplunkTrust

For 500k random IPs, I only get two debug values:

| makeresults count=500000 | eval record_ip = (random()%256).".".(random()%256).".".(random()%256).".".(random()%256)
| lookup ip_asn_dat_lookup ip as record_ip OUTPUT asn as record_asn asn_error as record_asn_error debug
| stats count dc(debug) max(record_asn) min(record_asn)

count   dc(debug)   max(record_asn) min(record_asn) 
500000  2   199473  -1

I uncommented the cookie parts, imported random, commented out unneeded imports - the debug column contains values like 0.23543456356 so that feels correct.
Note, commenting out the splunklib.searchcommands import has significantly boosted the lookup's speed 😄 it now does 50k random IPs in under a second, was 1.7 or so.

The MaxMind lookups are identical, without profiling the only blame can be assigned on the input-output protocol being different. The lookup protocol is as simple as a textual stream can be, not sure about the SDK search command protocol... could there be json involved?

I'm not familiar enough with the SDK search commands to immediately see why, but do read back that I couldn't put more than 50k events through the command. That can't be right.

Doing some very basic profiling:

start = time.time()
org = geoip_db.org_by_addr(ip)
event["cookie"] = time.time()-start

| makeresults count=50000 | eval record_ip = "8.8.8.8" | ipasndat | stats count sum(cookie) values(asn)

That cookie sum is almost 9 seconds, with about 11 shown by Splunk for the command. With some rounding errors and other overhead that's basically all the time taken by the command... so from where I'm sitting, something in the geoip lookup is eating up your cycles. It's not the SDK search command input/output.

Doing the same basic profiling for the lookup I get a sum of 3.9 for a total of 1.1... I have 4 cores 😄
It seems the lookup is running in parallel while the search command is not? Wouldn't explain the entire difference though...
With 500k instead of 50k I get odd results though, a sum of 50 for a total of 4... I don't have 12 cores 😞 though it might be throwing more threads at my cores, resulting in a longer wallclock per thread without a big impact on the overall runtime.

Communicator

Thanks Martin! I appreciate you taking the time to look into this with me.

It is good to see that you got similar performance results, but still disappointing that the custom search command is so slow. A couple questions about your comments:

You say that there doesn't seem to be excessive re-launching of python every 255 events, but what led me to that conclusion is the code I inserted to add a "cookie" field, which is a random number computed at the top of each python script. That code might actually be commented out in the ipasn_dat_lookup.py script right now, but you should be able to comment it back in (in that particular script, it might be called the "debug" field, rather than "cookie". sorry for the inconsistency there). I ran the lookup scripts on 1,000,000 events, and stats showed that about (1,000,000/255) unique values for the "cookie" (or "debug") fields existed for the lookup versions of the commands. For the custom search commands, there was only one unique value. So, I concluded that the external lookup scripts were getting re-invoked every 255 events, while the custom search commands were only dispatched once and streamed events in chunks. Would you agree with this line of reasoning?

If the old external lookup method really is faster, I would like to find a definitive answer why.

The best information I could find on the inner workings of traditional Splunk streaming commands was this post from 2012, but it doesn’t explicitly say how much parallelization there is, nor does it say if that applies to external lookup commands. Overall, nothing has explained why my results differ so radically from the performance benchmarks in this powerpoint on the V2 protocol.

Now that you have reviewed my code, note that the MaxMind lookups are being performed the exact same way across both the custom command and external lookup versions. So, with your comment about something "inside the custom command [being] terribly inefficient?" -- do you mean that the custom command protocol itself is tremendously inefficient? If so, is that something we should try to report to the Splunk Dev's? Either way, if external lookups are always going to be this much faster than custom commands, it seems like that should be stated somewhere.

This question as a whole is actually closely related with another I have out there. Since you already have the code, I was hoping you might be able to take a look at this:
https://answers.splunk.com/answers/496840/custom-streaming-command-wont-distribute-to-indexe.html

Try as I may, that custom command refuses to distribute to indexers. I feel like an answer to that question may be more achievable, and I would very much like to find out what the problem is. I have implemented versions with the V2 protocol and the old Intersplunk way of doing things, and the streaming command isn't showing up in remoteSearch either way.

0 Karma

SplunkTrust
SplunkTrust

This looks like it's working \o/

I'm running this on a standalone old windows box, using this first approach - note, the custom command appears to be limited to 50000 results over here:

| makeresults count=50000 | eval record_ip = "8.8.8.8"  
| lookup ip_asn_dat_lookup ip as record_ip OUTPUT asn as record_asn asn_error as record_asn_error debug
| stats count values(asn) values(record_asn)

I get 1.5 seconds in total, of which 1.0 for the lookup (command.lookup).

| makeresults count=50000 | eval record_ip = "8.8.8.8"  
| ipasndat
| stats count values(asn) values(record_asn)

I get 11.7 seconds in total, of which 10.3 for the command (command.ipasndat + dispatch.evaluate.ipasndat).

The disadvantage of the custom command is much greater here - but then I've been cheating, reusing the same ip every time. The external lookup facility likely is caching things here... to random ips!

...nope! By doing (random()%256).".".(random()%256).".".(random()%256).".".(random()%256) I can slow down the lookup, but only from 1.0s to 1.7s. The custom command doesn't slow down, it might even be a bit quicker with random IPs... but still almost an order of magnitude slower for me.

Looking at the processes running during the custom command search, I see two python PIDs pop up with one disappearing again quickly and the other sticking around for those ten seconds, occupying one core completely. Hence I'd say it isn't spawning new pythons, but instead something inside the custom command is terribly inefficient.
Looking at the 1.x second lookup is less fun, so I liberally added a zero to the makeresults (no constraining 50k here)... at 500k random IPs, there is a small blip of a python process, and the lookup only shows 4.3s - 10x the results, 3x the run time. In the millions the runtime blows up again, but I also see memory constraint issues with generating the data on this box...

6.5.2 here.

...so, what did we learn? Not a lot. The lookup's faster on my end too. There doesn't seem to be excessive re-launching of python every 255 events for the lookup, that would show up as a huge penalty on windows. Looking at the two .py files, I'd guess the simple CSV-based protocol of the lookup might be significantly faster than the fancy SDK protocol for searchcommands. To be sure, you'd need to do a spot of python profiling.

Communicator

Sorry for the trouble, Martin. It looks like I had been playing with the dat command's decorators during testing and forgot to change them back.

Here's a new version, with the underscores ripped out:
ipasn

(side note: do you happen to know why the underscores were causing so much grief?)

I tested your 8.8.8.8 lookup command right before packaging using "eval ... | ipasndat" and it worked fine on this version.

If you encounter any other errors, one thing to try from the command line is:

/opt/splunk/bin/splunk cmd python ipasn_dat.py < ipasn.csv

With the V2 protocol, I don't think that will yield any actual results like the lookup version of the commands would, but it might at least bring any import errors or things like that to light.

One last note, when I first installed these commands a couple weeks ago, I initially got that "error code 1" message several times, and then all of a sudden the commands mysteriously started working. I have seen similar behavior when installing the official GeoASN app as well.

Let me know how it goes this time! I'm assuming there must be a better way to propagate errors up through to the job inspector than what I have now? I'll look into that, but hopefully the new package will work better already.

0 Karma

SplunkTrust
SplunkTrust

This works:

| makeresults | eval ip = "8.8.8.8" | lookup ip_asn_dat_lookup ip

Yields 15169 correctly.

However, I can't seem to get the command to work.

| makeresults | eval record_ip = "8.8.8.8" | ipasn_dat
Error in 'ipasn' command: External search command exited unexpectedly with non-zero error code 1.

It appears to run ipasn and ignores the _dat part... that's probably why you called the command ipasn instead of ip_asn? 😄
(side note - it referred to a file name ipasn_dat.py which doesn't exist, again the underscores...)

After fixing myself a commands.conf definition without underscores, I get this:

ERROR ChunkedExternProcessor - Error in 'ipasndat' command: External search command exited unexpectedly with non-zero error code 1.

Don't seem to be able to find logs from the command 😞

0 Karma

Communicator

Hey Martin,

Here's a packaged version of both apps:
ip_asn_lookup.spl
ip_asn.spl

Note that there are two commands in each app. The lookup version has ip_asn_lookup and ip_asn_dat_lookup, while the custom search command has ip_asn and ip_asn_dat. The "dat" command versions query the legacy MaxMind binary database, while the others query the newer ".mmdb" format. My results have shown that the legacy API is actually faster, so whichever commands you use for testing, make sure to use the ones with equivalent MaxMind API's.

Also on that topic, the "lookups" folders originally contained the paid versions of the MaxMind databases: GeoIP2-ISP.mmdb, and GeoIPASNum.dat. Since I am not free to distribute those files, I have removed them from both apps, and replaced them with just the Lite version of GeoIPASNum.dat (I couldn't find a lite version of the ISP database at the moment). So, without adding your own copy of GeoIP2-ISP.mmdb to the lookups folders, only the "_dat" commands will work.

The custom search commands assert that a field named "record_ip" exists in each event, upon which the ASN lookup is performed. Meanwhile, the external lookups operate on just an "ip" field, as defined in transforms.conf.

Let me know if you have any questions, and I look forward to hearing back.

0 Karma

SplunkTrust
SplunkTrust

Can you post both working examples to take a look, ideally as a self-contained .spl package?

0 Karma