Solved: Trigger alert once for all results but exclude res...

JustinSC · ‎04-07-2021

I'd like to have an alert that throttles per result, but triggers only once per schedule run (instead of once per host). Or solve this problem some other way.

For example if a service restarts we'll pick up an event from the log and send an email, but if 10 restart at the same time we don't want 10 emails, just one. We want some sort of throttle on the alert because:

We're monitoring Production systems and we need the alert to fire within a minute or two of the event occurring. So we're using a schedule like * * * * *, or maybe */5 * * * * depending on the criticality of the event.
We can't guarantee that the event is delivered to Splunk on time. On occasion someone will be messing with the SAN and indexing will get delayed by a couple minutes, or perhaps the host rebooted and the event won't be sent for another few minutes. We're stuck with this environment for the time being. This (to my understanding) prevents us from using a simple schedule like * * * * * and earliest=-1m@m because we could miss events that don't arrive within the minute.
Therefore the best way to ensure we don't miss events has been increase the range of the search, for example to -5m@m. However this means we'll catch the same event multiple times if our schedule is still * * * * *. Hence wanting to exclude the events we've already alerted on once.

Is there a good way to approach this problem? I have similar concerns about using summarization searches where I want the summary results quickly but don't trust the events to be searchable quickly enough.

Thank you.

bowesmana · ‎04-07-2021

A typical way to handle data that is delayed in arriving at the indexer is to run the alert over a small time window in the past, so if your alert runs once every 5 minutes, then you would do

earliest=-6m@m latest=-1m@m

and you can change that window to ensure it will catch your worst case.

You can make your window a single minute if you need to get more frequent checks, then your window could be

earliest=-4m@m latest=-3m@m

However, if your data is delayed more than the window, then you will miss the alert, so if you really need to search a larger window to catch _time of the event, then you can always then use _indextime to determine which events are the ones for the alert, so if you have a 5 minute sliding window, say -6m@m to -1m@m then you could write the query so that you are only sending the alerts for those events where _indextime is between -2m@m and -1m@m

Running the search every minute on cron will only ever give you events indexed in the 1m window, whereas your search range will be the 5 minute window.

Hope this helps

View solution in original post

JustinSC · ‎04-07-2021

Thank you. The sliding windows are what I currently use for converting events to metrics (i.e. running a report from -4m@m to -3m@m and mcollect the results). It's nice to know that's a recommended way to handle those.

For using the indextime is this right?

sourcetype=somesourcetype event=SomeEvent earliest=-10m@m latest=now | where _indextime >= relative_time(now(), "-1m@m") AND _indextime < relative_time(now(), "@m")

Is there any reason I can't use the previous minute (@m) for the upper bound of indextime, or should I always go back a bit?

bowesmana · ‎04-07-2021

Yes, that's the right query.

I'm in favour of always bounding both start and end time when using scheduled searches. In your example your search latest is now, but that's unlikely to every get events as it would imply events with _indextime<_time, possible but not really what you want.

Due to clock sync, I would always avoid using @m with a cron schedule that runs every minute, as it's conceivably possible that you might miss an event that is just being indexed.

Unless you are looking for as close to real time as possible, I would go for latest bound as -1m@m rather than @m

JustinSC · ‎04-08-2021

Thanks! This has been very helpful.

I thought setting the upper limit to "now" would help ensure that if a server somehow ends up a few too many seconds ahead of Splunk (and therefore _time is greater than _indextime) it wouldn't be excluded, since the next search window would exclude the event as _indextime is now outside the window. Is that right or is my logic bad?

bowesmana · ‎04-11-2021

Actually I think your logic is right given that you're using indextime to define the window

bowesmana · ‎04-07-2021

A typical way to handle data that is delayed in arriving at the indexer is to run the alert over a small time window in the past, so if your alert runs once every 5 minutes, then you would do

earliest=-6m@m latest=-1m@m

and you can change that window to ensure it will catch your worst case.

You can make your window a single minute if you need to get more frequent checks, then your window could be

earliest=-4m@m latest=-3m@m

However, if your data is delayed more than the window, then you will miss the alert, so if you really need to search a larger window to catch _time of the event, then you can always then use _indextime to determine which events are the ones for the alert, so if you have a 5 minute sliding window, say -6m@m to -1m@m then you could write the query so that you are only sending the alerts for those events where _indextime is between -2m@m and -1m@m

Running the search every minute on cron will only ever give you events indexed in the 1m window, whereas your search range will be the 5 minute window.

Hope this helps

Trigger alert once for all results but exclude results recently alerted on

alert condition

throttling

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?