In props.conf, I have a time-based auto-lookup: "LOOKUP-jobstart = jobstart host OUTPUT jobid, user", against a periodically-updated csv file with columns "time,host,jobid,user". This is a killer feature for us, but here is something undesired:
Say I search for "eventtype=OOM", and I get events with various jobids via the above. Now I want to omit events for jobid=8290453, so I alt-click on the jobid value. The search becomes "eventtype=OOM NOT jobid=8290453", which looks right, but "stats count by jobid" is different than before. Yes jobid=8290453 is gone, but some of the counts for other jobids are now less - where did the events go?
Looking at job inspector, I see that the search has internally become:
DEBUG: base lispy: [ AND [ OR oom [ AND killed memory of out process ] [ AND allocation failure page ] [ AND allocate failed to ] [ AND enough memory not ] [ AND memory of out process ran ] [ AND acquire enough huge memory to unable ] ] [ OR 8290453 [ NOT sourcetype::syslog ] [ AND [ NOT host::rs2767 ] [ NOT host::rs2768 ] [ NOT host::rs2769 ] [ NOT host::rs2770 ] [ NOT host::rs2771 ] [ NOT host::rs2772 ] [ NOT host::rs2773 ] [ NOT host::rs2774 ] [ NOT host::rs2775 ] [ NOT host::rs2776 ] [ NOT host::rs2777 ] ] ] ]
This reveals the details of eventtype=OOM and that the lookup is for sourcetype=syslog, but I think it also shows that the NOT condition has expanded to all the hosts associated with jobid=8290453 (yes, those hosts are associated with that jobid in jobstart.csv.gz, but only for a limited time range). So, if those hosts have OOM's during other jobids, I won't see those either? I'm not 100% sure that is what is happening because I'm struggling with "| set diff", but I think so.
For comparison, a search for "sourcetype=OOM | lookup jobstart host | search NOT jobid=8290453" does the desired thing, eg "stats count by jobid" is the same, except that jobid=8290453 is missing. This is a manual workaround, but now if I alt-click another jobid, I'm back in unhappy land. Basically alt-click doesn't work the way I'd like (or expect) for time-based auto-looked-up fields.
If I am understanding the issue (please correct if I'm wrong), this seems like a bug - the lookup is inherently time-based, so technically the auto-search-writer should do something like
"... (NOT (_time>jobstarttime AND _time<jobendtime AND (host=rs2770 OR host=rs2771 OR ...)))". If it did, I'd be able to get rid of one of my hairy macro which does this 🙂
So, 1) am I understanding the issue? And 2) is there a way to get the above desired behavior from alt-click on a time-based auto-looked-up field? Eg via different method, or revision of the internal query interpreter?
You can usually bypass the
reverse lookup optimizations by piping your base events through
| search like this:
index=myIndex | search eventtype=OOM"
Now when you alt-click, your search should be like this:
index=myIndex | search eventtype=OOM" NOT jobid=8290453
If you then examine your normalized job search text, it should not have all of that stuff in it and it may behave differently (more consistently), if perhaps somewhat slower (speedup is the purpose of this optimization).
Very good question. I don't have a good answer for you, but your analysis sounds right. Lookups are only loosely date-effective and so therefore, I'm not sure it's possible for splunk to internally implement the "_time>jobstarttime AND _time<jobendtime" logic because splunk only allows a single timestamp per lookup record.