I'm trying to identify inactive hosts that crashed (through an alert).
Inactive hosts - hosts that haven't logged in the past 1hr
host that didn't crash- logs a message like this ".* Gracefully Exited"
host that did crash- never logs a message like the one above ^ and eventually becomes inactive
For inactive hosts, I've found this search to be useful. It searches the past 2 hours for host that haven't logged within the last hour:
| tstats latest(_time) as latest where index=a sourcetype=b source = c earliest=-2h by host
| eval logged_within_past_hour = if(latest > relative_time(now(),"-1h"),1,0), time_of_host_last_log = strftime(latest,"%c") | where logged_within_past_hour=0
I'm able to use this splunk search to find logs where the host terminated.
index=a sourcetype=b Gracefully Exited
Is there a way to find hosts that crashed and have became inactive? I don't want to include the hosts that terminated successfully and didn't crash
Check out the Track Me and Meta Woot! apps. They do that kind of thing for you.
If you want to do it yourself, be aware that finding something that is not there is not Splunk's strong suit. You'll need a list of expected hosts to compare against those seen recently. See this blog entry for a good write-up on it.
https://www.duanewaddle.com/proving-a-negative/
Hm, similar to that post, would I be able to do this kind of set manipulation (can also use table with count > 1 to make a set)?
Take the Hosts that have logged from [-2hr ago, now]
- Hosts that print the graceful exit message [-2hr ago, now] (excluding graceful exits)
------------------------------------------------------------------------------------------------------------------------------
Now we're left with running and crashed Hosts that have logged from [-2hr ago, now]
- Hosts that have logged from [-1hr ago, now]. (excluding running and crashed hosts within [1hr ago, now])
Now we're left with still-running* and crashed Hosts that have ONLY logged from [2hr ago, 1hr ago]. In this case the "still -running ones" haven't logged from [1hr ago, now]. I'm going to declare them crashed.
Does this work out? I'm unsure how to implement this
I think you need a couple of subsearches. Something like this:
<<Hosts that have logged>> earliest=-2h latest=now NOT [ <<Hosts that print the graceful exit message>> earliest=-2h latest=now | fields host | format ] NOT [ <<Hosts that have logged>> earliest=-1h latest=now | fields host | format ]Be sure to test each subsearch separately to make sure they return the expected results.