Splunk Enterprise

How to check Up/Down hosts with Splunk and/or the Telegraf Agent?

Mark90
Explorer

We are trying to verify if a server is up or down via different ways, but none seem to be working for us.

We are monitoring our infrastructure via the Telegraf Agent, and as far as we know, Telegraf does not have an embedded "up" metric for its agent, so we were running a random metric query, and filling nulls wherever the query didn't find info:

Mark90_0-1623096134768.png

We were testing this live turning off the agent of a server and the server would just dissapear from the list, instead of showing down.


So we tried to run a pure splunk-based query to see if it would solve the reliability problem we were having with Telegraf:


| tstats latest(_time) AS latest where index=* earliest=-24h BY host
| eval host=lower(host)
| eval recent=if(latest > relative_time(now(),"-5m"),1,0), realLatest=strftime(latest, "%c")
| where recent=0
| table host latest recent realLatest

We understand that "recent=0" means that, that specific host, is not sending any event, therefore can be considered from our end as "down". Problem is that for most of the "recent=0" servers,  they were still showing up in telegraf sending metrics normally.

Is there any reliable way to monitor up/down hosts in Splunk?

 

| tstats latest(_time) AS latest where index=* earliest=-24h BY host
| eval host=lower(host)
| eval recent=if(latest > relative_time(now(),"-5m"),1,0), realLatest=strftime(latest, "%c")
| where recent=0
| table host latest recent realLatest

 

Labels (1)
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!