We've seen inconsistent problems where one log file on a host will stop indexing in Splunk once the log file rotates after a JVM restart, possibly due to a timestamp change. We're working with Splunk on the root cause. Typically other logs on the same host will continue indexing as normal. I'm trying to think of an efficient way to alert when this happens while we work to resolve it, and from what I can tell SOS does not contain a readymade solution.
The basics are simple - alert when we have 0 events in the last x minutes. What makes this complicated is that we have a large number of sources that I care about, in the hundreds. In our taxonomy the sourcetype and source attributes will be shared across multiple logs on multiple hosts, eg the sourcetype=jvm_log and source=/www/logs/server.log attributes will be common to somewhere between 2-12 individual files across as many hosts. Ideally I'd want an alert for any unique host+source combination where the event count=0 for the last x minutes, without writing hundreds of nearly identical searches.
This should be much faster on Splunk 6:
| tstats latest(_time) as latest where index=* by host source index | where latest < relative_time(now(), "-5m")
Tune the -5m
to whatever timeout you need. Adjust the where
clause to only grab what you want to look at.
Create an alert based on this search:
index=main earliest=-24h@h latest=now | fields host,source | dedup host,source | eval hasData=0
| append [ search index=main earliest=-5m latest=now | stats count AS hasData by host,source ]
| table host,source,hasData
| dedup host,source sortby -hasData
| search hasData=0
The idea is basically first list all sources/hosts we know about for the last 24h. Later you append a search on the data you have for the last 5 minutes. You'll endup with duplicated record for source/host combination you have data and single record for source/host you don't. Eliminate the duplicate and see the ones with 0.
You can tune the 24h and the last 5 minutes according to you needs.
Hope it helps.
Cheers!