I was under the impression that forwarders send a heart beat back to the indexers. How can I create an alert for when a forwarder that hasn't checked in within the last 5 minutes per example?
Ok, way simpler just to enable email on the DMC and clone the stock DMC Alert - Missing Forwarders and then Advanced edit that. I have awarded points to both people for their efforts and to get me on the right track.
| inputlookup dmc_forwarder_assets
| search status="missing"
| eval status = "Not Reachable"
| eval "Last Connected" = strftime(last_connected,"%m-%d-%Y %H:%M:%S")
| rename status as Status
| rename os as OS
| rename hostname as "Source Host"
| table "Source Host" "Last Connected" OS Status
Let's take apart the DMC Missing forwarders alert in Splunk 6.3!
It's called DMC Alert - Missing forwarders
, with contents:
| inputlookup dmc_forwarder_assets
| search status="missing"
| rename hostname as Instance
So we need to figure out how those dmc_forwarder_assets
are created, that is, via a macro called dmc_build_forwarder_assets(1)
:
`dmc_set_index_internal` sourcetype=splunkd group=tcpin_connections NOT eventType=*
| stats
values(fwdType) as forwarder_type,
latest(version) as version,
values(arch) as arch,
values(os) as os,
max(_time) as last_connected,
sum(kb) as new_sum_kb,
sparkline(avg(tcp_KBps), $sparkline_span$) as new_avg_tcp_kbps_sparkline,
avg(tcp_KBps) as new_avg_tcp_kbps,
avg(tcp_eps) as new_avg_tcp_eps
by guid, hostname
which also includes a second macro dmc_set_index_internal
, which is simply:
index=_internal
Then we have one last macro called dmc_re_build_forwarder_assets(1)
which is the essence of the dmc_forwarder_assets
lookup:
`dmc_build_forwarder_assets($sparkline_span$)`
| rename new_sum_kb as sum_kb, new_avg_tcp_kbps_sparkline as avg_tcp_kbps_sparkline, new_avg_tcp_kbps as avg_tcp_kbps, new_avg_tcp_eps as avg_tcp_eps
| eval avg_tcp_kbps_sparkline = "N/A"
| addinfo
| eval status = if(isnull(sum_kb) or (sum_kb <= 0) or (last_connected < (info_max_time - 900)), "missing", "active")
| eval sum_kb = round(sum_kb, 2)
| eval avg_tcp_kbps = round(avg_tcp_kbps, 2)
| eval avg_tcp_eps = round(avg_tcp_eps, 2)
| fields guid, hostname, forwarder_type, version, arch, os, status, last_connected, sum_kb, avg_tcp_kbps_sparkline, avg_tcp_kbps, avg_tcp_eps
| outputlookup dmc_forwarder_assets
So what is the real take-away here? Let's take out the relevant parts:
index=_internal sourcetype=splunkd group=tcpin_connections NOT eventType=*
| stats
max(_time) as last_connected,
sum(kb) as sum_kb by guid, hostname
| addinfo
| eval status = if(isnull(sum_kb) or (sum_kb <= 0) or (last_connected < (info_max_time - 900)), "missing", "active")
| where status="missing"
where we can tweak the value 900 (15 minutes) to whatever we want.
Thank you!
Is there a way for this search to keep showing up if I set the Relative Time to 10 minutes (for alerting)? So a host connection time might have been 1 hour ago which would work with the info_max_time -60 but not within the Last 10 minutes to the relative search. I would need it to keep showing up when running the search above.
index= _internal sourcetype = splunkd group = tcpin_connections NOT eventType = *
| stats max(_time) as last_connected, sum(kb) as sum_kb by guid, hostname
| addinfo
| eval "Source Host" = hostname
| eval ttnow = now()
| eval Current = strftime(ttnow,"%m-%d-%Y %H:%M:%S")
| eval Status = if(isnull(sum_kb) or (sum_kb <= 0) or (last_connected < (info_max_time - 60)), "Not Reachable", "active")
| eval "Last Connected" = strftime(last_connected,"%m-%d-%Y %H:%M:%S")
| where Status = "Not Reachable"
| table "Source Host" "Last Connected" Current Status
why can't you just change 900
to 600
and schedule the alert to run every ten minutes? Sorry, I'm not sure what you mean all the way.
Say the computer went offline at 1300 hrs, anytime between 1300 and 1310 it will show up in the result set and the alert triggers fine. After 1320 when the alert runs, it won't show up because the query for last connected was in the previous 10 minutes not in the current ten minutes. If I change the alert for a 60 minute window it will work until that 10 minute search falls out of the 60 minute time window. Meaning we should keep getting the alerts until it stops saying "missing". I hope that wasn't too confusing??
@rfiscus - I'm not positive I'm following, but if I understand, it should still show up after 1320 if it is still disconnected because in the Status
field you have multiple conditions:
So if it is really disconnected, the sum_kb field should be zero and the forwarder should show up again on the next scheduled run of the alert / search.
Ultimately, your comment about the 60 minute window versus the 10 minute window - unless you want to embed a bunch of smaller versions of this search within itself and do some _time bucketing, you'll only have the granularity of your earliest and latest in the main search. So if the search above ran every hour, and you set the logic to last_connected < (info_max_time - 3600)
- then you would only have the granularity of one hour - that is, if a forwarder went offline at 13:27, and your alert runs every hour on the hour, then you'd have to wait until 14:00 for your alert to run again and find out that a forwarder went offline. If you wanted to know exactly when machines were going off and on, maybe you could explore changing stats
to timechart
so you can see those finer time-slices
You could use the altering feature included in the Distributed Management Console, and define an alert for the forwarders there.
Check http://docs.splunk.com/Documentation/Splunk/6.2.0/Admin/Platformalerts for details.
I agree and that works ok, how can I get these events into the indexers so I can query them via the search head? We do not allow emailing from our DMC.
If you need them to show up in the splunk index, you could define a script as altering action, and parse the script output back to splunk again.
But why would you need that? you can use the alert manager as well to monitor this: http://docs.splunk.com/Documentation/Splunk/6.2.0/Alert/Setupalertactions