I have a job that is remotely triggered which should be run at least once within a 24 hour period. The start message (i.e. "Job Triggered") appears in /var/log/messages. What is the optimal way to search/report for hosts that DO NOT have the Job Triggered message within a 24 hour period?
So far, I have this in the search cmd:
source="/var/log/messages" host="*" "Job Triggered." earliest=-1d | dedup host | stats count by host
This shows the results, but doesn't tell me how many hosts didn't have the Job Triggered in that period.
In order to evaluate against history (to find a gap), you'll have to collect some history. A way that this is achieved in the Deployment Monitor app (which ships with Splunk) is to utilize a summary index that's used to "remember what is seen". Another way would be to use | inputlookup
combined with | outputlookup
to create a CSV file that has some history.
Ultimately, you'd end up with a list of "hosts we've seen kick off the job over all time, and the last time they ran it", and then perform some time math against | eval this_time=now()
to see if it's > 24h.