So I have a search that counts the number of successful dns server health checks over the last 5 minutes for all of our networks. The search runs the individual counts and then pipes to a search command to look for any counts less that a certain value. When I run the search, there is a brief period where a false positive shows up (i.e ~1 second) until the search completes. Is this the reason I am getting false positive alerts? We also suspect that this fires when splunk falls behind on indexing. (We index a LOT of data).
Methinks there's actually two questions hidden in there.
Does Splunk finish a search before firing an alert?
Certainly. Else anything triggering for "some count is safe over some threshold" would always fire prematurely.
What happens when indexing is delayed?
The search finishes, but doesn't consider events that haven't been indexed yet - here you can get "false" positives.
The key is to not search from e.g.
now but rather to allow for indexing delay depending on your environment. If for example you know that you get a minute of delay you could move your search back two minutes to be safe and run it from
-2m@m. That way you still have your five-minute windows, but don't get affected by indexing delays.
To debug time ranges that did trigger a false positive, run the search manually at a later time.
If that search result would not trigger, roll back the reporting and look at the
_indextime hidden field for the events used to calculate your trigger value. Check if some events were indexed after the alert was originally run.
Skimming through the relevant docs page at http://docs.splunk.com/Documentation/Splunk/6.1.3/Alert/Definescheduledalerts I don't see it explicitly say "the alert is triggered after the search has finished", but the example given (alert if count of purchases yesterday is less than 500) would trigger false positives every time if that weren't the case due to the count not yet being 500 in the previewed search results.
Right. I too skimmed it, but couldn't find anything. The problem is that we pipe the results the search command and do the count filtering there. For just a fraction of a second, we get results that would trigger an alert before the count is met and the result(s) disappear... Im concerned that an alert is triggered for that small window....
The key is to not search from e.g. -5m@m to now but rather to allow for indexing delay depending on your environment. If for example you know that you get a minute of delay you could move your search back two minutes to be safe and run it from -7m@m to -2m@m.
We already do that 😉
As for logic around Splunk finishing a search, I'm kinda looking for a 'chapter and verse' kind of confirmation. Is there somewhere in the docs that can absolutely confirm this? (I agree that is would be a poor tool if it didn't, but direct confirmation is always desirable)