OK; this one's odd... what might cause a lookup in a search to only return results some of the time...?
Brief description:
I have a search for tracking Windows user authentications in a high-flux DHCP environment, which I implement like so,
Scheduled report on DHCP log runs every 2mins and outputs to a lookup table containing DHCP IP, client hostname, MAC, DHCP lease issue time and DHCP lease release time - the report search is rather convoluted; it returns DHCP ACK and RELEASE events, then marries them up with what's already in the lookup table to maintain a historic record of which IP's were issued to which client devices and when.
Search Windows Security log for Kerberos and NTLM authentication events, then look up the IP (Kerberos) or hostname (NTLM) in the DHCP lookup table, and evaluate the authentication event timestamp against the lease-issued and lease-released timestamps in the lookup table to return a nice, traceable set of properties for the source of the authentication event.
the problem is that the DHCP lease table lookup only works when it feel like it - some of the time, the search returns fully-populated results like it's supposed to, but most of the time, it doesn't, despite the data it should hit on being present in the lookup table.
What could be causing this, and how would I even go about troubleshooting it, let alone fixing it?
This sounds like a race condition where you are looking into the lookup table before it is available from the last update. This is exactly the kind of thing (latency from disk writes) that the new KV Store
was designed to address. I would immediately switch from using outputlookup
to disk and move to KV Store
and probably your troubles will go away. The other thing that you might do is inspect your jobs, particularly the Normalized Search
string, and compare ones that fail to ones that don't. Splunk does some crazy optimizations when normalizing and I have seen it do the wrong thing, especially with lookups
. One way to bypass most of Splunk's normalization by reverse-lookup optimizations is to add a superfluous | search
as far left in your search as you can. See if this makes any difference.
That actually sounds none too implausible... especially since the search includes two lookups to the DHCP table - one to help pin down the DHCP lease during which the auth event was generated based on the lease isse and release timstamps, the other to actually retrieve the client properties...
Checking the jobs log shows the scheduled search of the DHCP log is fairly speedy - runtime's always less than a minute, but the main search, especially if it covers a decent stretch of time, can take over 5mins...
Haven't had to use a kvstore yet; off I go to do some reading.