I'm borrowing from Mick's answer here. I just want to point out that you can use this metadata approach to really capture two different scenarios:
Your forwarder stopped sending you data. (The original question)
Your forwarder is sending you data from the future. (time travel?)
I would like to make the argument that both are of equal importance to monitor, for the following reasons:
We want to know when a forwarder is down (obviously)
It's important to detect when the system clock on a forwarder is out of sync with the indexer.
Timestamp recognition problems need to be fixed ASAP.
Not checking for future events inhibits your ability to detect when a forwarder goes down. (e.g. if an event is received for 10 days in the future then the lastTime will not be able to reflect the current time correct time until that point. Therefore if the forwarder goes down within that time frame, it will not be detected by an alert that only is looking for old events.)
Here is a search that can detect both situations:
| metadata index=_internal type=hosts | eval age=time()-lastTime | search age>60 OR age<-15 | sort age d | convert ctime(lastTime) | fields age,host,lastTime
There are a few things to note here:
If you are forwarding internal events then the index=_internal will let you check for down forwarder on a much quicker interval. (Since metrics events are generated every 30-seconds). If you are not, then pick your most active index, or you could use something like | metadata type=hosts | append [ metadata index=os type=hosts ] | stats max(lastTime) as lastTime by host to query more than one index. (Try to make sure that you pick an index that is less-prone to timestamp configuration glitches which could prevent this alert from working properly.)
Notice that I'm using time() here and not now() . This is because we want the current system clock instead of the time the search was scheduled to run. This allows for a more accurate value for age, which can otherwise be skewed due to delays is scheduled search execution. (Note that if your are running a version prior to 4.1, then you'll need to use now() instead of time() , and you will need to change the -15 number to something like -60 or -90, all depending on your potential scheduler delays)
... View more