i have some scheduled searches. Some run every 5 minutes, some 15 minutes some hourly etc.
Some of those searches are there to generate a summary index, a few other to exportcsv to feed it into another tool regularly.
if there is an outage of the search head (even with two search heads or SH pooling) some jobs might be skipped or missed as they won't rerun. This will result in a not complete dataset.
In the first step, i'll would need a Report which shows me, based on the actual schedule which search was skipped.
index=_internal source=*scheduler.log | eval sched = strftime(scheduled_time, "%Y-%m-%d %H:%M:%S") | search savedsearch_name="Project Honey Pot - Threatscore AVG Last 4 Hours" NOT continued | table sched status savedsearch_name
User Activity Search
- Last RUN | scheduled every 5 Minutes | STATUS=Completed
- Last RUN - 5 minutes | STATUS=Completed
- Last RUN -10 Minutes | STATUS=Completed
- Last RUN -15 Minutes | Status=Not Executed
- Last RUN -20 Minutes | Status-Completed
and this for each scheduled search dynamically with the scheduled every 5 minutes.
The above example would show a potential Restart of the SH 15 minutes ago. And then i can manually investigate and re-run the export for this specific timeframe to add the data again... or it can review the last 10 successfull runs - subtract the times and then automatically detect that it is running all 5 minutes.
Thanks a lot
thanks for your hint - but this shows only successfull run vs. errors. if i have a scheduled activity every 5 minutes and i shutdown my instance of splunk for 1 hour and start it again - i won't get it displyed that it misses several scheduled activities...
Correct, and that's because the scheduler does not keep state across restarts and naturally does not log anything. I suppose that you can modify your search to include a condition that checks whether a shutdown event occurred in splunkd.log during the timerange in question and add a field to indicate so. Then use the information in this field to make a decision on whether or not to re-run the reports.
i found a nearly best solution including being independent from the schedule with kmeans. So first listing all successfull runs, then calculating the delta of each run and then with kmeans adding some statistical calculation to find the outlier. as if the system was off or one scheduled task was missing this can be shown:
`set_internal_index` host="mmaier-mbp15.local" source=*scheduler.log savedsearch_name="BlueCoat - Stats - Collect" status!=continued | stats max(run_time) as Max, count by _time savedsearch_name | sort -_time | delta _time as delta | where Max>0 | eval delta = round (delta/60*-1,2) | kmeans delta | sort -_time | replace 1 with OK, 2 with "ERROR" in CLUSTERNUM
Even in the line visualization i can visualize it very good: