Hi,
when trying to filter a high EPS feed with a lookup I am experiencing quite some performance issues. Are are known paths to get this to run faster?
Basically, I have a feed (averages betwewen 500-2000 EPS) that contains hostnames in "querystring1" and I am trying to filter that using a simple static list of known bad hostnames. The latter has 19.000 entries and does a wildcard match for e.g. "*hitnet.net".
So far I have tried the following:
A) lookup on the index with returning only non-null fields
SPL: index=FeedToFilter | lookup RBL matchstring as querystring1 OUTPUT matchstring as test | where isnotnull(test)
transforms.conf: case_sensitive_match = false / match_type = WILDCARD(matchstring)
This variant is quite slow with taking about 5 minutes to search 2 minutes of events, even on historical data. Adding non-null filters like where "isnotnull(test)" or "search NOT test=*" does slow it down even more.
B) inputlookup on the index
SPL: index=FeedToFilter [ | inputlookup RBL | rename matchstring as matchto | fields + matchto ]
This variant either does not start or takes about 10 minutes to start when the inputlookup is limited with "head 500" (with unlimited inputlookup chrome simply cannot access splunk anymore as long as the search is running. IE remains responsive, but also does not show progress in the unlimited search).
From what I found on Splunk Answers inputlookup will not be a solution as it is limited to < 10k entries. But a lookup should work fine and not choke like that. Any insight on how to speed up that query would be greatly appreciated.
Thanks,
Oliver
The following seems to help a lot:
index=FeedToFilter "IN A*" | rex ".(?
a) added "IN A*": some prefilter reducing event to work with by 50%
b) rex: eliminating the lookup wildcard match by introducing a rex which only takes the SLD and TLD from full hostnames. E.g. "www.example.org" get removed to "example.org", which is exactly what I have in the lookup csv file.
Now searching 60 minutes of events only takes 80 seconds, not 20 minutes.
If anyone has further ideas, I would be happy to hear them.
The following seems to help a lot:
index=FeedToFilter "IN A*" | rex ".(?
a) added "IN A*": some prefilter reducing event to work with by 50%
b) rex: eliminating the lookup wildcard match by introducing a rex which only takes the SLD and TLD from full hostnames. E.g. "www.example.org" get removed to "example.org", which is exactly what I have in the lookup csv file.
Now searching 60 minutes of events only takes 80 seconds, not 20 minutes.
If anyone has further ideas, I would be happy to hear them.
Filtering early is always a good idea, provided you know your data really well.
The lookup filtering is slow because Splunk needs to load every event, perform the lookup, and then throw it out - instead of filtering using the index structure, and only reading what is likely to be required by the search.
However, filtering through 120k events (2 minutes at 1k eps) should not take 5 minutes despite this lookup drawback, assuming reasonably specced indexers.
playing around with inputlookup shows the following:
index=FeedToFilter [|inputlookup RBL | head 5000 | fields + matchstring | rename matchstring as querystring1]
using the [|inputlookup] instead of a *|lookup removes the delay described above.
[|inputlookup] with head=50 is as fast as a non-parametrized search.
[|inputlookup] with head=500 does not add a performance decrease.
[|inputlookup] with head=5000 suddenly explodes the search, it takes 472 seconds:
392.306 dispatch.evaluate
392.295 dispatch.evaluate.search
Just some more info. With 500k Events:
"index=FeedToFilter" takes 45 seconds
80.435 command.search
107.877 dispatch.stream.remote
"index=FeedToFilter | lookup RBL matchstring as querystring1 OUTPUT matchstring as test" takes 475 seconds
1,259.78 command.lookup
467.29 dispatch.fetch
1,378.063 dispatch.stream.remote
"index=FeedToFilter | lookup RBL matchstring as querystring1 OUTPUT matchstring as test | search NOT test=*" takes 482 seconds
1,241.517 command.lookup
473.866 dispatch.fetch
1,354.832 dispatch.stream.remote