Ok, way simpler just to enable email on the DMC and clone the stock DMC Alert - Missing Forwarders and then Advanced edit that. I have awarded points to both people for their efforts and to get me on the right track.
| inputlookup dmc_forwarder_assets | search status="missing" | eval status = "Not Reachable" | eval "Last Connected" = strftime(last_connected,"%m-%d-%Y %H:%M:%S") | rename status as Status | rename os as OS | rename hostname as "Source Host" | table "Source Host" "Last Connected" OS Status
Let's take apart the DMC Missing forwarders alert in Splunk 6.3!
DMC Alert - Missing forwarders, with contents:
| inputlookup dmc_forwarder_assets | search status="missing" | rename hostname as Instance
So we need to figure out how those
dmc_forwarder_assets are created, that is, via a macro called
`dmc_set_index_internal` sourcetype=splunkd group=tcpin_connections NOT eventType=* | stats values(fwdType) as forwarder_type, latest(version) as version, values(arch) as arch, values(os) as os, max(_time) as last_connected, sum(kb) as new_sum_kb, sparkline(avg(tcp_KBps), $sparkline_span$) as new_avg_tcp_kbps_sparkline, avg(tcp_KBps) as new_avg_tcp_kbps, avg(tcp_eps) as new_avg_tcp_eps by guid, hostname
which also includes a second macro
dmc_set_index_internal, which is simply:
Then we have one last macro called
dmc_re_build_forwarder_assets(1) which is the essence of the
`dmc_build_forwarder_assets($sparkline_span$)` | rename new_sum_kb as sum_kb, new_avg_tcp_kbps_sparkline as avg_tcp_kbps_sparkline, new_avg_tcp_kbps as avg_tcp_kbps, new_avg_tcp_eps as avg_tcp_eps | eval avg_tcp_kbps_sparkline = "N/A" | addinfo | eval status = if(isnull(sum_kb) or (sum_kb <= 0) or (last_connected < (info_max_time - 900)), "missing", "active") | eval sum_kb = round(sum_kb, 2) | eval avg_tcp_kbps = round(avg_tcp_kbps, 2) | eval avg_tcp_eps = round(avg_tcp_eps, 2) | fields guid, hostname, forwarder_type, version, arch, os, status, last_connected, sum_kb, avg_tcp_kbps_sparkline, avg_tcp_kbps, avg_tcp_eps | outputlookup dmc_forwarder_assets
So what is the real take-away here? Let's take out the relevant parts:
index=_internal sourcetype=splunkd group=tcpin_connections NOT eventType=* | stats max(_time) as last_connected, sum(kb) as sum_kb by guid, hostname | addinfo | eval status = if(isnull(sum_kb) or (sum_kb <= 0) or (last_connected < (info_max_time - 900)), "missing", "active") | where status="missing"
where we can tweak the value 900 (15 minutes) to whatever we want.
Is there a way for this search to keep showing up if I set the Relative Time to 10 minutes (for alerting)? So a host connection time might have been 1 hour ago which would work with the info_max_time -60 but not within the Last 10 minutes to the relative search. I would need it to keep showing up when running the search above.
index= _internal sourcetype = splunkd group = tcpin_connections NOT eventType = * | stats max(_time) as last_connected, sum(kb) as sum_kb by guid, hostname | addinfo | eval "Source Host" = hostname | eval ttnow = now() | eval Current = strftime(ttnow,"%m-%d-%Y %H:%M:%S") | eval Status = if(isnull(sum_kb) or (sum_kb <= 0) or (last_connected < (info_max_time - 60)), "Not Reachable", "active") | eval "Last Connected" = strftime(last_connected,"%m-%d-%Y %H:%M:%S") | where Status = "Not Reachable" | table "Source Host" "Last Connected" Current Status
Say the computer went offline at 1300 hrs, anytime between 1300 and 1310 it will show up in the result set and the alert triggers fine. After 1320 when the alert runs, it won't show up because the query for last connected was in the previous 10 minutes not in the current ten minutes. If I change the alert for a 60 minute window it will work until that 10 minute search falls out of the 60 minute time window. Meaning we should keep getting the alerts until it stops saying "missing". I hope that wasn't too confusing??
@rfiscus - I'm not positive I'm following, but if I understand, it should still show up after 1320 if it is still disconnected because in the
Status field you have multiple conditions:
So if it is really disconnected, the sum_kb field should be zero and the forwarder should show up again on the next scheduled run of the alert / search.
Ultimately, your comment about the 60 minute window versus the 10 minute window - unless you want to embed a bunch of smaller versions of this search within itself and do some _time bucketing, you'll only have the granularity of your earliest and latest in the main search. So if the search above ran every hour, and you set the logic to
last_connected < (info_max_time - 3600) - then you would only have the granularity of one hour - that is, if a forwarder went offline at 13:27, and your alert runs every hour on the hour, then you'd have to wait until 14:00 for your alert to run again and find out that a forwarder went offline. If you wanted to know exactly when machines were going off and on, maybe you could explore changing
timechart so you can see those finer time-slices
You could use the altering feature included in the Distributed Management Console, and define an alert for the forwarders there.
Check http://docs.splunk.com/Documentation/Splunk/6.2.0/Admin/Platformalerts for details.
If you need them to show up in the splunk index, you could define a script as altering action, and parse the script output back to splunk again.
But why would you need that? you can use the alert manager as well to monitor this: http://docs.splunk.com/Documentation/Splunk/6.2.0/Alert/Setupalertactions