Splunk Enterprise Security

Correlation Search Skipped / Backfill Operations

kirchoff
Explorer

Hi all,

We intermittently see some ES correlation searches getting “skipped” at their scheduled run time (we confirm this from _internal logs). When the correlation search runs normally, it triggers Notable Event creation and Risk Analysis (Risk Modifier) adaptive response actions. However, when a run is skipped, the corresponding time window is effectively missed: no notable events are created and no risk modifier events are generated, which creates a security coverage gap.

What we need is a supported/recommended way to backfill / replay a correlation search for a specific earliest/latest time range that was skipped. The replay must:

  • Execute the same SPL for the missed window,
  • Trigger the same Adaptive Response actions (Notable + Risk Modifier) based on the results,
  • Not create duplicates (no duplicate notables and no duplicate risk events) — i.e., the replay should be idempotent.

If anyone has faced this before, I’d appreciate any best-practice recommendations or a supported approach to handle replay/backfill for skipped correlation searches in ES.

0 Karma

kirchoff
Explorer

Thank you both for your detailed answers. I completely agree with your assessments and recommendations. Nearly all of our correlation searches are configured to run against accelerated data models, and their average runtime is around 10 seconds. The searches are scheduled to run every 5 minutes, and we have distributed their cron schedules evenly (e.g., */5 * * * *, 1/5 * * * *, 2/5 * * * *, etc.) to reduce concurrency peaks.

However, in certain situations where system performance drops, some searches take over 300 seconds to complete, which leads to them being skipped as you described. As you suggested, we are investigating the root causes (infrastructure, query optimization, etc.) so that skips do not occur in the first place. In parallel, we also want to be prepared with an effective workaround for the cases that still slip through.

In particular, I'm exploring whether there might be a workaround involving Adaptive Response actions to somehow manually trigger the relevant correlation rules or actions when a scheduled search is skipped due to resource limitations. If you have any ideas or have seen such approaches implemented (for example, triggering alerts or responses outside the normal search schedule when skips are detected), your input would be extremely valuable.

PickleRick
SplunkTrust
SplunkTrust

+1 on Rich's advice as to firstly check _why_ you're getting skipped searches.

The most typical causes are:

1) Too little processing power and your hardware cannot handle all those searches you're throwing at it

2) Too many searches being spawned at the same time

3) Ineffectively written searches

4) Any combination of the three above.

Switching to continuous scheduling can help in some cases but if your environment is constatntly overstressed you'll end up with searches getting run with more and more lag.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

The first thing to do is determine *why* the searches are being skipped.  Then you can take corrective measures and avoid having to worry about replays.

The most common causes of skipped searches are 1) too many searches running at the same time; and 2) searches not completing before the next scheduled run-time.  The Monitoring Console will list skip searches and the reason(s) for the skips.

To correct the first issue, examine the schedules of each search and ensure they are evenly distributed around the clock (as much as is practical, at least).  It's very common for the majority of searches to run at top of each hour.  To avoid that, look for cron schedules that start with "0" or "*".

To correct the second issue, examine the search to see why it is slow.  It's possible the SPL could be made more efficient, or perhaps it's searching too much data or too large of a time window.  If the search cannot be improved then change the schedule so it runs less often.

Finally, consider running the searches on a continuous schedule.  That prevents them from being skipped.  See https://help.splunk.com/en/splunk-enterprise-security-7/tutorials-and-use-cases/7.3/correlation-sear...

---
If this reply helps you, Karma would be appreciated.
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Announcing Modern Navigation: A New Era of Splunk User Experience

We are excited to introduce the Modern Navigation feature in the Splunk Platform, available to both cloud and ...

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

We are excited to announce that the upcoming releases of Splunk Enterprise 10.2.x and Splunk Cloud Platform ...

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

After a whole week of being on call, you fell asleep on your keyboard, and you hit a sequence of buttons that ...