Im currently overhauling the search architecture and am looking to classify my data into types, some of which will have subtypes. Upon investigation, it seems that I could create automatic lookups to classify most of my types/subtypes, except for classifying based on when
field!=value (I am unaware of a way to lookup something that doesn't exist). I could also use eventtypes to build all of the auto classification, which seem to be more flexible in terms of defining the data types (can use full search terms, not just matching key=value pairs).
Is it more efficient performance-wise to use lookup tables over eventtypes?
It seems that the lookup tables apply to only the sourcetype specified, while eventtypes would have to check every event regardless of sourcetype to see if there is a match. If I define an eventtype with some search terms and a sourcetype (e.g.
sourcetype=thislog foo=value), will Splunk only try to match assign that eventtype to the mentioned sourcetype? I read somewhere that there is at least SOME optimization for eventtypes, although it was not clearly documented or explained.
Also, what about using a calculated field to add classification data? how does that compare to eventtypes or lookups. My searches typically return around 1 million events, with occasional searches returning maybe 300 million events.
Lookups aren't preferable from a search performance point of view. To filter based on a lookup Splunk has to load events, add the lookup, and then filter. Slooooow!
Calculated fields are similarly bad, Splunk has to load events, calculate field, filter. Slooooow!
Eventtypes (and tags) differ in that they're just names for regular search language bits. As a result, Splunk can mostly filter stuff before loading the events. Vrooooom!
Thanks for the insight. I actually did some testing which confirms your points, I experienced more performance degradation with calculated fields and lookups as opposed to eventtypes. In the end, the best solution for me was a combination of eventtypes/tags and implementing lookup tables without automatic lookups. I setup a macro to easily add the lookup command manually so our system doesn't take the performance hit if I am returning tons of events with no need for the lookup table data.
Adding the lookup automatically isn't a great performance impact for reporting searches, ie ones that have a stats, chart, table, ... at the end. Splunk will only add the fields from the lookup if the reporting commands require the fields.
The performance impact comes in only if you filter by fields added by the lookup.
I would add here, that lookup tables can be helpful in speeding up searches when used as pre-filters e.g. like
[ |inputlookup myhostfilter | search myfiltercriteria=xyz | fields host ] index=blah
This may reduce disk i/o by a lot.
Splunk tries to do back-generation of the requisite terms via both calculated fields and lookups. It's much harder to do this for the generic calculated field case, but for lookups it's pretty easy. At search startup time, we traverse the table and do a reverse mapping of output terms to input terms, and use those to acquire the events.
As a result, lookups are frequently not actually slower than eventtypes and tags.
As a caveat, it's tricky to get the logic right in scripted lookups to ensure this happens efficiently. We need to create better examples.