Hi all,
We intermittently see some ES correlation searches getting “skipped” at their scheduled run time (we confirm this from _internal logs). When the correlation search runs normally, it triggers Notable Event creation and Risk Analysis (Risk Modifier) adaptive response actions. However, when a run is skipped, the corresponding time window is effectively missed: no notable events are created and no risk modifier events are generated, which creates a security coverage gap.
What we need is a supported/recommended way to backfill / replay a correlation search for a specific earliest/latest time range that was skipped. The replay must:
If anyone has faced this before, I’d appreciate any best-practice recommendations or a supported approach to handle replay/backfill for skipped correlation searches in ES.
Thank you both for your detailed answers. I completely agree with your assessments and recommendations. Nearly all of our correlation searches are configured to run against accelerated data models, and their average runtime is around 10 seconds. The searches are scheduled to run every 5 minutes, and we have distributed their cron schedules evenly (e.g., */5 * * * *, 1/5 * * * *, 2/5 * * * *, etc.) to reduce concurrency peaks.
However, in certain situations where system performance drops, some searches take over 300 seconds to complete, which leads to them being skipped as you described. As you suggested, we are investigating the root causes (infrastructure, query optimization, etc.) so that skips do not occur in the first place. In parallel, we also want to be prepared with an effective workaround for the cases that still slip through.
In particular, I'm exploring whether there might be a workaround involving Adaptive Response actions to somehow manually trigger the relevant correlation rules or actions when a scheduled search is skipped due to resource limitations. If you have any ideas or have seen such approaches implemented (for example, triggering alerts or responses outside the normal search schedule when skips are detected), your input would be extremely valuable.
+1 on Rich's advice as to firstly check _why_ you're getting skipped searches.
The most typical causes are:
1) Too little processing power and your hardware cannot handle all those searches you're throwing at it
2) Too many searches being spawned at the same time
3) Ineffectively written searches
4) Any combination of the three above.
Switching to continuous scheduling can help in some cases but if your environment is constatntly overstressed you'll end up with searches getting run with more and more lag.
The first thing to do is determine *why* the searches are being skipped. Then you can take corrective measures and avoid having to worry about replays.
The most common causes of skipped searches are 1) too many searches running at the same time; and 2) searches not completing before the next scheduled run-time. The Monitoring Console will list skip searches and the reason(s) for the skips.
To correct the first issue, examine the schedules of each search and ensure they are evenly distributed around the clock (as much as is practical, at least). It's very common for the majority of searches to run at top of each hour. To avoid that, look for cron schedules that start with "0" or "*".
To correct the second issue, examine the search to see why it is slow. It's possible the SPL could be made more efficient, or perhaps it's searching too much data or too large of a time window. If the search cannot be improved then change the schedule so it runs less often.
Finally, consider running the searches on a continuous schedule. That prevents them from being skipped. See https://help.splunk.com/en/splunk-enterprise-security-7/tutorials-and-use-cases/7.3/correlation-sear...