I wrote two versions of the same Python streaming command: one as a simple external lookup script, and one as a full custom search command (using V2 of the custom search command protocol). I tested the performance of both commands, and found that the external lookup script was much faster.... which is highly counter-intuitive.
Why might this be? Is there a reason custom search commands could actually be slower than equivalent external lookup scripts?
Here is a break-down of the two command versions.
1. External Lookup Script
Loads geoip database into memory with MEMORYCACHE flag
Uses the csv module to read events from stdin
Performs a geoip lookup on each event's ip field, stores result in new field
Writes each line (event) back to stdout using csv module
2. Custom Search Command with V2 Protocol
Loads geoip database and defines custom streaming command like so:
#create GeoIP instance with Memory Cache
geoip_db = pygeoip.GeoIP(ISP_DB_PATH, pygeoip.const.MEMORY_CACHE)
def stream(self, events):
#transform each event in the chunk
for event in events:
...... [lookup logic goes here]
dispatch(ipasnCommand, sys.argv, sys.stdin, sys.stdout, __name__)
Note that both command versions are written in Python, use the same geoip lookup library with the same caching flag, and make the same lookup function calls.
Also note that while the custom streaming command is only dispatched/invoked once and events are passed in chunks, Splunk seems to re-invoke the external lookup script every 255 events . . . which means the geoip database gets reloaded and caching is wiped out, leading one to hypothesize that the external lookup version should perform much worse.
However, multiple trials confirm that when given 1 million events to process, the custom search command takes an average of 00:09:10, while the external lookup can do it in 00:07:06.
I was disappointed to observe such a large performance deficit from the custom search command, despite all the supposed advantages. Does anyone have some insight into what could be causing this? Is this performance gap to be expected?
22 seconds for 1M lookups, single splunk instance, dual-core windows laptop.
I used this database for testing: http://lite.ip2location.com/database/ip-asn - the cidr, asn, as fields together are 28MB. If you can do without the as that'd shrink considerably again.