We have a large number of hosts reporting to Splunk, and sometimes (rarely), some of them stop sending events. Is there an elegant search for hosts, which have last reported anything more than T ago?
I'd like to make an alert for T being above, say, 6 hours or so...
can't you just talk to the humans that do have access to install apps???
Much easier than you re-inventing the wheel. Also based on the question below about why a lookup is necessary, I would recommend you save the scars of learning 😉
Plus once your alert goes nuts...you'll see why the app is so cool
This is, what I ended up using -- thanks to @gcusello for the
stats ... BY host idea:
a search for normal events | fields host, _time | stats max(_time) AS most_recent by host | where most_recent < relative_time(now(), "-5h") | eval most_recent = strftime(most_recent, "%F %T")
The above performs whatever search you typically use, then looks for hosts, that haven't reported any search-satisfying matches within the specified time (5 hours in the above example). The search time-range is set by the usual time-picker, which should, obviously, include the alert time.
relative_time call can, probably, be expressed nicer, but this works.)
you have to create a lookup (e.g. called perimeter.csv with a field called host) containing all the hosts to monitor; then you have to run a search like this:
| metasearch index=_internal | eval host=lower(host) | stats count BY host | append [ | inputlookup perimeter.csv | eval host=lower(host), count=0 | fields host count ] | stats sum(count) AS total BY host | where total=0
In this way you have all the hosts from your list that didn't send logs in the monitoring period.
You can create an alert to run e.g. every 5 minutes.
If you delete the last row and add the row
| eval status=if(total=0,"Missing","Up") you have a dashboard that display the host status.
Thanks for the ideas, but why do I need to create a lookup? The hosts are already known to Splunk -- all those, that have reported in the last, say, 30 days, but have not reported in the last 5 hours.
A manually managed lookup is the easiest way to be sure about the monitoring perimeter: if you e.g. take the hosts of last 24 hours, you don't check hosts that didn't send in the last period!
Anyway, it this could be sufficient for you, you can schedule a search every night that populates the perimeter.csv lookup so you haven't to do nothing.
| metedata index=_internal earliest=-24h | dedup host | sort host | table host | outputlookup perimeter.csv
and then run the above search e.g. every 5 minutes.
your solution surely solves your functional need, but I think that's a very slow search if you use _internal (this means that you cannot execute it in an alarm e.g. every five minutes!) and a not sure search if you use a different index (because it's possible that you don't have nothing to receive on that index!).
In addition, you don't check servers that didn't send logs in the search timeframe.
I used the above solutions for an alert (with a frequency of 5 minutes) that's running from many years!
Ciao and next time!