We are running sec in parallel. A few days ago, I had sec alert on a stack dump, but the rt search set to email didn't alert on it. I matched the event from sec with an event in splunk, so it was indexed.
What can be possible causes of splunk not alerting or 'finding' the even to alert on it.
If the docs are true, then the rt alert/searches should never ever miss a event trigger when matched. The rt searches are supposed to see the data as it streams in, before it hits the index.
I am not sure what is the root cause in your case. But there is possible for splunk to miss real time alert if timestamp of log is not exactly correct. Your splunk server system time and timestamp of index log should be exactly consist. In particular, if you use real time small window like second, splunk easily miss to alert if there is any time difference.
There are at least 2 potential causes. The first is clock skew - timestamping as Takajian mentions - if you do a 1 minute rt window search on events on a machine where the timestamps are all 2 minutes behind the Splunk indexer then none of the events will fall within the evaluation window. The fixes for this are easy - make the window bigger, adjust the clocks, etc.
Another potential issue is if you write a very greedy search in terms of the first pass that we do on events before matching them to the full search query that fills the memory buffer to the point where we do have to drop events. You can see this in metrics.log where Drop_count indicates that an event has been pushed out the buffer and that an event of interest MAY have been missed. Here is an example
01-18-2011 05:03:43.856 -0800 INFO Metrics - group=realtime_search_data, system total, drop_count=0, mean_preview_period=4.071817
and here is a search you can do to see if this is the case:
index=_internal group="realtime_search_data" | where sid NOT null | dedup sid | table _time, sid, drop_count, mean_preview_period
Funny that you mention clock skew. I've been tracking down a clock skew issue on my servers. They are running ntpd, but I'm still having an issue. They are all running on AWS which are xen based VM's (including my splunk instance). I've been reading up on it: http://www.brookstevens.org/2010/06/xen-time-drift-and-ntp.html
What is your take on that?
jflomenberg: I executed your search, I'm not sure how to interpret the results. The dropcount column is empty, but the meanpreview_period has values from 0.00xxxxx to 11.011xxxxx (that large value is from today).
I'm probably not the right person to comment on fixing clock skew.
0 drop count means no lost events.
A meanpreviewperiod that high could just be the indexer getting bogged down with scheduled jobs or lots of ad hoc searches. We evaluate less frequently if we think we're going to dop the ball on other stuff. It would be more curious if nothing of any significance was happening on the machine at the time that the large value was observed.