How to detect not-reporting hosts?

unitedmarsupial · ‎02-04-2020

We have a large number of hosts reporting to Splunk, and sometimes (rarely), some of them stop sending events. Is there an elegant search for hosts, which have last reported anything more than T ago?

I'd like to make an alert for T being above, say, 6 hours or so...

mattymo · ‎02-05-2020

can't you just talk to the humans that do have access to install apps???

Much easier than you re-inventing the wheel. Also based on the question below about why a lookup is necessary, I would recommend you save the scars of learning 😉

Plus once your alert goes nuts...you'll see why the app is so cool

- MattyMo

unitedmarsupial · ‎02-05-2020

This is, what I ended up using -- thanks to @gcusello for the stats ... BY host idea:

a search for normal events
| fields host, _time
| stats max(_time) AS most_recent by host
| where most_recent < relative_time(now(), "-5h")
| eval most_recent = strftime(most_recent, "%F %T")

The above performs whatever search you typically use, then looks for hosts, that haven't reported any search-satisfying matches within the specified time (5 hours in the above example). The search time-range is set by the usual time-picker, which should, obviously, include the alert time.

(The relative_time call can, probably, be expressed nicer, but this works.)

gcusello · ‎02-04-2020

Hi @unitedmarsupials,
you have to create a lookup (e.g. called perimeter.csv with a field called host) containing all the hosts to monitor; then you have to run a search like this:

| metasearch index=_internal
| eval host=lower(host)
| stats count BY host
| append [ | inputlookup perimeter.csv | eval host=lower(host), count=0 | fields host count ]
| stats sum(count) AS total BY host
| where total=0

In this way you have all the hosts from your list that didn't send logs in the monitoring period.
You can create an alert to run e.g. every 5 minutes.
If you delete the last row and add the row | eval status=if(total=0,"Missing","Up") you have a dashboard that display the host status.

Ciao.
Giuseppe

unitedmarsupial · ‎02-05-2020

Thanks for the ideas, but why do I need to create a lookup? The hosts are already known to Splunk -- all those, that have reported in the last, say, 30 days, but have not reported in the last 5 hours.

gcusello · ‎02-05-2020

Hi @unitedmarsupials,
A manually managed lookup is the easiest way to be sure about the monitoring perimeter: if you e.g. take the hosts of last 24 hours, you don't check hosts that didn't send in the last period!

Anyway, it this could be sufficient for you, you can schedule a search every night that populates the perimeter.csv lookup so you haven't to do nothing.

| metedata index=_internal earliest=-24h
| dedup host
| sort host
| table host
| outputlookup perimeter.csv

and then run the above search e.g. every 5 minutes.

Ciao.
Giuseppe

gcusello · ‎02-05-2020

Hi @unitedmarsupials,
your solution surely solves your functional need, but I think that's a very slow search if you use _internal (this means that you cannot execute it in an alarm e.g. every five minutes!) and a not sure search if you use a different index (because it's possible that you don't have nothing to receive on that index!).
In addition, you don't check servers that didn't send logs in the search timeframe.

I used the above solutions for an alert (with a frequency of 5 minutes) that's running from many years!

Ciao and next time!
Giuseppe

mattymo · ‎02-04-2020

please check out "trackme" on Splunkbase by the amazing app by @guilmxm

https://splunkbase.splunk.com/app/4621/

Great app that helps you manage and alert on data sources!

- MattyMo

unitedmarsupial · ‎02-05-2020

Thanks, but I don't have the access necessary to install new apps...

gjanders · ‎02-04-2020

Great app! If you want an alternative to that app try Broken Hosts or Meta Woot!

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

mattymo · ‎02-05-2020

yep! Honorable mention for meta woot for sure!

- MattyMo

How to detect not-reporting hosts?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

New Release of Federated Search: Bringing Splunk Analytics to More of Your Data

Inside Event Intelligence: How ITSI Turns Network Alerts into Actionable Incidents

Observability Simplified: Combining User Experience, Application Performance & ...

Join the Conversation

How to detect not-reporting hosts?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

New Release of Federated Search: Bringing Splunk Analytics to More of Your Data

Inside Event Intelligence: How ITSI Turns Network Alerts into Actionable Incidents

Observability Simplified: Combining User Experience, Application Performance & ...