We've seen inconsistent problems where one log file on a host will stop indexing in Splunk once the log file rotates after a JVM restart, possibly due to a timestamp change. We're working with Splunk on the root cause. Typically other logs on the same host will continue indexing as normal. I'm trying to think of an efficient way to alert when this happens while we work to resolve it, and from what I can tell SOS does not contain a readymade solution.
The basics are simple - alert when we have 0 events in the last x minutes. What makes this complicated is that we have a large number of sources that I care about, in the hundreds. In our taxonomy the sourcetype and source attributes will be shared across multiple logs on multiple hosts, eg the sourcetype=jvm_log and source=/www/logs/server.log attributes will be common to somewhere between 2-12 individual files across as many hosts. Ideally I'd want an alert for any unique host+source combination where the event count=0 for the last x minutes, without writing hundreds of nearly identical searches.
The idea is basically first list all sources/hosts we know about for the last 24h. Later you append a search on the data you have for the last 5 minutes. You'll endup with duplicated record for source/host combination you have data and single record for source/host you don't. Eliminate the duplicate and see the ones with 0.
You can tune the 24h and the last 5 minutes according to you needs.