Solved: Scheduled searches missing events at random

elliotproebstel · ‎11-20-2017

I have about a handful of scheduled searches that run at regular intervals. They are monitoring for pretty straightforward events, doing some minor field manipulation, and logging these events into a summary index. The search head that is running these searches is connected to two indexer clusters (one each at two geographic sites), and it is configured with no site affinity. (We used to have this search head configured with site affinity for its geographic site, but we changed it after the discussion on this question: https://answers.splunk.com/answers/591103/diagnosing-latency-between-indextime-and-searchabi.html)

What I am seeing now is that the scheduled searches are running without reporting any errors. They run on schedule, and I see remote_search logs from each indexer, as expected. But random events are missing. For example, during one search interval yesterday, the scheduled search found 3 events - but when I now search back over that same time interval, I see 4 events. One occurred (and was indexed) before the missing event, and two occurred (and were indexed) after the missing event. Moreover, one of the detected events was indexed on the same indexer as the non-detected event - so I know the indexer ran the search. The lag time (between event _time and event _indextime is <2 seconds, and the event occurred pretty much in the middle of the time window over which the scheduled search was running (not on the cusp, where timing issues might account for the error).

The problem is having a statistically significant impact on operations. Over just two such scheduled searches that I've been tracking for 5 days, I see one search has only detected 104/120 total events, and the other has detected 514/593. Has anyone else encountered this? If so, do you have tips for tracking the source of the issue?

somesoni2 · ‎11-21-2017

Based on the wording, I'm assuming you're using basic scheduling (Every 5 min, Every 15 min dropdown value) instead of cron schedules. I've seen problems when there are too many scheduled searches running at a time. With basic scheduling, every 5 mins searches run at 0,5,10,15,20....;every 10 mins searches will run at 0,10,20,30... and every 15 min searches run at 0,15,30,..., so you can see they overlap multiple times, specially on 0th min and 30min of the hour. Apart from adding delay in time range, you should also distribute the scheduling evenly to avoid overlap as much as possible. So lets say you've 10 searches that run every 5 mins, I would suggest you to change their schedule to use cron schedule as follows:

1/5 * * * * (every 5 min on 1,6,11,16,21,26... minutes, time range can be -6m@m to 11m@m)
2/5 * * * * (every 5 min on 2,7,12,17,22,27... minutes, time range can be -7m@m to 12m@m)
3/5 * * * * (every 5 min on 3,8,13,18,23,28... minutes, time range can be -8m@m to 13m@m)

... and so on. This way they all run at different minute and doesn't overlap to eat up resources/quota. Similar exercise can be done for every 10 mins (2/10 * * * *) and every 15 mins (3/15 * * * *).

View solution in original post

micahkemp · ‎11-21-2017

Is the search potentially relying on any extractions that your have available to you but the app running the search does not? You may want to double check where the fields are defined that your search needs to make use of.

elliotproebstel · ‎11-21-2017

Not as far as I can tell. The events that get detected appear identical to the events that do not get detected. I'm running my test/baseline queries inside the same app that is doing the scheduled searches, and I don't have any field extractions defined under my account. I've literally never added a field extraction myself into this Splunk deployment, but is there any way to triple-check?

micahkemp · ‎11-21-2017

If the events (missing and present) appear to use the same fields I think you should be good in that regard.

Are there missing and present events on the same indexer(s)? Or are the missing ones on different splunk_servers than the present ones?

You may also try scheduling an identical search to run for the same time period that you're seeing the issue and see if it always exhibits the same behavior, or if it really does appear to be randomly changing.

elliotproebstel · ‎11-21-2017

Yeah, the missing and present events both use the same fields for detection. The splunk_server values for the source events don't demonstrate any patterns - it looks like they are being reasonably well load-balanced, and events are being detected and missed from the same distribution of splunk_servers. So, the source events aren't clustering on a single splunk_server, and the missed events are well-distributed across the clusters.

somesoni2 · ‎11-21-2017

Based on the wording, I'm assuming you're using basic scheduling (Every 5 min, Every 15 min dropdown value) instead of cron schedules. I've seen problems when there are too many scheduled searches running at a time. With basic scheduling, every 5 mins searches run at 0,5,10,15,20....;every 10 mins searches will run at 0,10,20,30... and every 15 min searches run at 0,15,30,..., so you can see they overlap multiple times, specially on 0th min and 30min of the hour. Apart from adding delay in time range, you should also distribute the scheduling evenly to avoid overlap as much as possible. So lets say you've 10 searches that run every 5 mins, I would suggest you to change their schedule to use cron schedule as follows:

1/5 * * * * (every 5 min on 1,6,11,16,21,26... minutes, time range can be -6m@m to 11m@m)
2/5 * * * * (every 5 min on 2,7,12,17,22,27... minutes, time range can be -7m@m to 12m@m)
3/5 * * * * (every 5 min on 3,8,13,18,23,28... minutes, time range can be -8m@m to 13m@m)

... and so on. This way they all run at different minute and doesn't overlap to eat up resources/quota. Similar exercise can be done for every 10 mins (2/10 * * * *) and every 15 mins (3/15 * * * *).

elliotproebstel · ‎11-21-2017

Thanks for the suggestion, @somesoni2. The cron syntax you proposed was rejected by Splunk, so I did a little digging and I think this is what you meant:

1-59/5 * * * *
2-59/5 * * * *
3-59/5 * * * *

Is that right?

Let's say this does fix the current issue, but our number of needed scheduled searches continues to grow as analysts develop new analytics, etc. Aside from running the kinds of "double-check that the scheduled searches didn't miss anything I can now detect" searches that I'm running now, are there other notices/events I could watch for that would indicate that the number of overlapping searches is hitting a threshold where we should expect that any of our scheduled searches is likely going to be missing events?

somesoni2 · ‎11-21-2017

That is correct. I used to use 1-59/5... syntax only but I believe in some version of Splunk accepted other syntax as well. Anyhow, let's use what always works.

If this indeed was the issue and got fixed by adjusting scheduling, I would suggest monitoring Search Concurrency (See this post, 2nd/3rd answers) and possibly load average/memory usage of your search heads. If both happens to grow (exceeding limits), you can think of adding more capacity to increase the limits (scaling).

elliotproebstel · ‎11-22-2017

Thank you! I'll look into those. I appreciate the help.

hardikJsheth · ‎11-20-2017

Can you try using rolling window instead? i.e if the search is scheduled to run at every 5 minutes then instead of using -5m@m and now as earliest and latest time use -6m@m and -1m@m as earliest and latest time.

DalJeanis · ‎11-21-2017

This is always a good idea. You should not be indexing something until your normal-time-to-clear-indexing-delay is complete. Take your typical indexing time delay, multiply by 1.5 and round up to the next minute, so for example, 90 second goes to 135 seconds and rounds up to a 3-minute delay.

elliotproebstel · ‎11-21-2017

@hardikJsheth and @DalJeanis - Yes, we have the searches running with a delay. We are looking at -5m@m through -10m@m for the searches running every 5 minutes, and we are looking at -5m@m through -20m@m for the searches running every 15 minutes.

somesoni2 · ‎11-20-2017

What's the cron schedule for these scheduled searches? How many concurrent searches are run during the time when scheduled searches are running?

elliotproebstel · ‎11-20-2017

Some of the searches are running every 5 minutes, and some are running every 15 minutes. The search logs indicate ~20 other searches are running during the same time. The searches in question are completing in 1-2 seconds and show no errors.

Scheduled searches missing events at random

Splunk Observability for AI

[Puzzles] Solve, Learn, Repeat: Dereferencing XML to Fixed-length events

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!

Join the Conversation

Scheduled searches missing events at random

Splunk Observability for AI

[Puzzles] Solve, Learn, Repeat: Dereferencing XML to Fixed-length events

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!