Deployment Architecture

Splunk Agent Status by Host

_gkollias
Builder

Recently, the splunk agent has been down on various servers, and I'd like to set up an alert that tells us which servers are in a "DOWN" state once splunk stops running. I've created this search based on other questions that have been asked here:

| metadata index=esb_prf type=hosts | append [metadata index=esb_dev type=hosts]
| eval host = replace(host,".gtg.com","")
| stats max(lastTime) AS last_time_active by host
| eval age = now() - last_time_active
| eval Status= case(age < 60,"Running",age > 60,"DOWN")
| convert ctime(last_time_active)
| search Status="DOWN"
| table host, last_time_active , Status
| sort host

The problem I'm seeing is that not every host that lists in the table is actually "DOWN", but some of them are.

Is there a way to modify this search to make it more accurate, and only show servers that are actually down? I've seen people use "index=_internal", but that is related to throughput of the Forwarder sending data. So if the Forwarder was up, but had no logs to send, you might get some false reports. This would probably be really rare, though. Another thing is searching _internal can be really slow because of all the stuff that gets crammed in there – especially in production.

Your help would be much appreciated.

Thank You

Tags (2)

Runals
Motivator

I'm using a two step process to various ends. Step 1 is get the most recent timestamp for every host, sourcetype, and source - write that to a csv. Step 2 is evaluate that list to alert if a host or particular sourcetype or source hasn't been seen in X period of time.

Build Query

| metasearch | rex field=host "(?^[^0-9]\S[^.]+)|(^[0-9]\S+)" | eval host = lower(host) | dedup index host sourcetype source | rex field=source "(?.*?)(?:.[^./]+?)?$" | eval last_seen = _time | table index host sourcetype path last_seen | inputlookup append=t host_data_last_seen.csv | stats max(last_seen) AS last_seen by index host sourcetype path | eval right_now = now() | eval time_diff = right_now - last_seen | where time_diff < (86400 * 4) | table index host sourcetype path last_seen | outputlookup host_data_last_seen.csv

Couple notes:

I have some FQ host names and some that aren't. Am making them all short. I also have a lot of sources that have dates at the end for log rotation. The second rex command deals with that (gracefully for the most part). Adjust the where statement for the number of days you keep data from systems that stop sending logs. In this case I figure if data hasn't come in after 4 days then it is probably decommissioned. I used dedup as it was faster than using stats to the same ends (1,300 forwarders / 1.5TB)

Alert Query

| inputlookup host_data_last_seen.csv | stats max(last_seen) as last_seen by host | eval right_now = now() | eval time_diff = right_now - last_seen | eval hours = round(time_diff/3600) | where hours >= 8 | eval alert = "Hours since logs last seen - " .hours | table host alert hours | sort -hours | fields host alert

martin_mueller
SplunkTrust
SplunkTrust

The fastest execution is probably to query the deployment server's client list like this:

| rest /services/deployment/server/clients

You'll get a list of deployment clients back, along with a timestamp of their latest phonehome request. Based on your phonehome intervals you can then determine deployment clients that should have phoned home but didn't. Something along these lines:

| rest /services/deployment/server/clients | where lastPhoneHomeTime < relative_time(now(), "-10m")

Adjust the "-10m" accordingly. Can't test that myself right now, just give it a shot.

martin_mueller
SplunkTrust
SplunkTrust

Are you running that on the deployment server?

0 Karma

_gkollias
Builder

hmmm...I'm not getting any results with that. I'll keep playing around with it. If you have any other suggestions I'm all ears. Thanks!

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

That should do all the filtering already, based on the current time and their last phonehome time. You can append the table and sort of course, if you like.

_gkollias
Builder

Thanks for the response.

I tried that and it just narrowed down the results from 32 to 23. A lot of the same servers that aren't really down are still showing up in the results.

0 Karma

somesoni2
Revered Legend

Increase the age limit that you are checking. 60 sec may be very low and with some latencies, you may get false calls. Set it to a value like 5 min (300) or 30 min (1800).

Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...