I'd like to have an alert that throttles per result, but triggers only once per schedule run (instead of once per host). Or solve this problem some other way.
For example if a service restarts we'll pick up an event from the log and send an email, but if 10 restart at the same time we don't want 10 emails, just one. We want some sort of throttle on the alert because:
Is there a good way to approach this problem? I have similar concerns about using summarization searches where I want the summary results quickly but don't trust the events to be searchable quickly enough.
Thank you.
A typical way to handle data that is delayed in arriving at the indexer is to run the alert over a small time window in the past, so if your alert runs once every 5 minutes, then you would do
earliest=-6m@m latest=-1m@m
and you can change that window to ensure it will catch your worst case.
You can make your window a single minute if you need to get more frequent checks, then your window could be
earliest=-4m@m latest=-3m@m
However, if your data is delayed more than the window, then you will miss the alert, so if you really need to search a larger window to catch _time of the event, then you can always then use _indextime to determine which events are the ones for the alert, so if you have a 5 minute sliding window, say -6m@m to -1m@m then you could write the query so that you are only sending the alerts for those events where _indextime is between -2m@m and -1m@m
Running the search every minute on cron will only ever give you events indexed in the 1m window, whereas your search range will be the 5 minute window.
Hope this helps
Thank you. The sliding windows are what I currently use for converting events to metrics (i.e. running a report from -4m@m to -3m@m and mcollect the results). It's nice to know that's a recommended way to handle those.
For using the indextime is this right?
sourcetype=somesourcetype event=SomeEvent earliest=-10m@m latest=now | where _indextime >= relative_time(now(), "-1m@m") AND _indextime < relative_time(now(), "@m")
Is there any reason I can't use the previous minute (@m) for the upper bound of indextime, or should I always go back a bit?
Yes, that's the right query.
I'm in favour of always bounding both start and end time when using scheduled searches. In your example your search latest is now, but that's unlikely to every get events as it would imply events with _indextime<_time, possible but not really what you want.
Due to clock sync, I would always avoid using @m with a cron schedule that runs every minute, as it's conceivably possible that you might miss an event that is just being indexed.
Unless you are looking for as close to real time as possible, I would go for latest bound as -1m@m rather than @m
Thanks! This has been very helpful.
I thought setting the upper limit to "now" would help ensure that if a server somehow ends up a few too many seconds ahead of Splunk (and therefore _time is greater than _indextime) it wouldn't be excluded, since the next search window would exclude the event as _indextime is now outside the window. Is that right or is my logic bad?
Actually I think your logic is right given that you're using indextime to define the window
A typical way to handle data that is delayed in arriving at the indexer is to run the alert over a small time window in the past, so if your alert runs once every 5 minutes, then you would do
earliest=-6m@m latest=-1m@m
and you can change that window to ensure it will catch your worst case.
You can make your window a single minute if you need to get more frequent checks, then your window could be
earliest=-4m@m latest=-3m@m
However, if your data is delayed more than the window, then you will miss the alert, so if you really need to search a larger window to catch _time of the event, then you can always then use _indextime to determine which events are the ones for the alert, so if you have a 5 minute sliding window, say -6m@m to -1m@m then you could write the query so that you are only sending the alerts for those events where _indextime is between -2m@m and -1m@m
Running the search every minute on cron will only ever give you events indexed in the 1m window, whereas your search range will be the 5 minute window.
Hope this helps