Splunk Agent Status by Host

_gkollias · ‎02-06-2014

Recently, the splunk agent has been down on various servers, and I'd like to set up an alert that tells us which servers are in a "DOWN" state once splunk stops running. I've created this search based on other questions that have been asked here:

The problem I'm seeing is that not every host that lists in the table is actually "DOWN", but some of them are.

Is there a way to modify this search to make it more accurate, and only show servers that are actually down? I've seen people use "index=_internal", but that is related to throughput of the Forwarder sending data. So if the Forwarder was up, but had no logs to send, you might get some false reports. This would probably be really rare, though. Another thing is searching _internal can be really slow because of all the stuff that gets crammed in there – especially in production.

Your help would be much appreciated.

Thank You

Runals · ‎02-06-2014

I'm using a two step process to various ends. Step 1 is get the most recent timestamp for every host, sourcetype, and source - write that to a csv. Step 2 is evaluate that list to alert if a host or particular sourcetype or source hasn't been seen in X period of time.

Couple notes:

I have some FQ host names and some that aren't. Am making them all short. I also have a lot of sources that have dates at the end for log rotation. The second rex command deals with that (gracefully for the most part). Adjust the where statement for the number of days you keep data from systems that stop sending logs. In this case I figure if data hasn't come in after 4 days then it is probably decommissioned. I used dedup as it was faster than using stats to the same ends (1,300 forwarders / 1.5TB)

martin_mueller · ‎02-06-2014

The fastest execution is probably to query the deployment server's client list like this:

| rest /services/deployment/server/clients

You'll get a list of deployment clients back, along with a timestamp of their latest phonehome request. Based on your phonehome intervals you can then determine deployment clients that should have phoned home but didn't. Something along these lines:

| rest /services/deployment/server/clients | where lastPhoneHomeTime < relative_time(now(), "-10m")

Adjust the "-10m" accordingly. Can't test that myself right now, just give it a shot.

martin_mueller · ‎02-06-2014

Are you running that on the deployment server?

_gkollias · ‎02-06-2014

hmmm...I'm not getting any results with that. I'll keep playing around with it. If you have any other suggestions I'm all ears. Thanks!

martin_mueller · ‎02-06-2014

That should do all the filtering already, based on the current time and their last phonehome time. You can append the table and sort of course, if you like.

_gkollias · ‎02-06-2014

Thanks for the response.

I tried that and it just narrowed down the results from 32 to 23. A lot of the same servers that aren't really down are still showing up in the results.

somesoni2 · ‎02-06-2014

Increase the age limit that you are checking. 60 sec may be very low and with some latencies, you may get false calls. Set it to a value like 5 min (300) or 30 min (1800).

Splunk Agent Status by Host

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers

Are you a member of the Splunk Community?

Splunk Agent Status by Host

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers