How can you differentiate between a forwarder being down and a forwarder not having any data to send ? i.e is there a heartbeat that i can tap into?
If you have deployed a number of Splunk forwarders and they are all pushing data to Splunk, you might not notice if one of them goes out of service, because the other forwarders are still pushing data to Splunk. You can run the following search to detect forwarders that have been up in the last 24 hours but not in the last 2 minutes. It uses the forwarder heartbeat, which is a feature of Splunk versions 3.2 and later.
index=_internal sourcetype="fwd-hb" starthoursago=24 | dedup host | eval age = strftime("%s","now") - _time | search age > 120 age < 86000
You can set this search up as an alert every several minutes so that Splunk will let you know if any of your active forwarders have not responded in the last 2 minutes.
If you're running a version of Splunk that is later than 3.3, the heartbeat message is not longer sent. Use the following search instead:
index=_internal "group=tcpin_connections" | stats max(_time) as latest by sourceHost | eventstats max(latest) as latest_all | eval lag = latest_all - latest | where lag > 120 | fields sourceHost lag
The following search works in 3.4.5 and finds all hosts who haven't sent a message in the last 24 hours
| metadata type=hosts | eval age = strftime("%s","now") - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime
and in 4.0:
| metadata type=hosts | eval age = now() - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime
Another 4.0 variant
| metadata type=hosts | sort recentTime desc | convert ctime(recentTime) as Recent_Time
Caveat: Many of these methods do not account for decommissioned hosts, which you are bound to have after a length of time. These hosts will also show up in the search results, as they also fit the criteria. Incorporating a host tag ('decommisioned', etc) into this search may help with this, but requires you to tag known hosts that are no longer valid.
try this
| metadata type=hosts
| eval lastHour=relative_time(now(),"-1h@h")
| eval yesterday=relative_time(now(), "-1d@d")
| where ( recentTime>yesterday AND recentTime
Good morning,
How can i check that the forwarders are sending the logs correctly? I have the following error in my logs:
"eventType=connect_fail" in metrics.log
metrics.log:12-17-2014 09:38:48.529 +0100 INFO StatusMgr - destHost=10.26.XX.XX, destIp=10.26.XX.XX, destPort=9997, eventType=connect_fail, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
This event produce that the los are not been sending correctly, them i need know if any option in the program execution can check if splunk is sending or not the data to the server.
And, can i resolve this issue in the configuration with some parameter? This issue only appear in determinate times isn't fixed.
Thanks and regards.
Hi All!
I just made another test and changed a bit the logic.
I was looking for Forwarders not being sending data for more than let's say 2 minutes. Here's my latest version:
index=_internal "group=tcpin_connections" | stats max(_time) as latest by sourceHost | eval nowtime = now() | eval lag = (nowtime - latest)/60 | where lag > 2 | fields sourceHost latest lag
If you prefere, you can work on seconds of course instead of minutes 🙂
Marco
I'm going to test this later on 4.1.3 and let you know. I need to provide our Customer a dashboard to monitor all the remote forwarder at a glance.
Marco
Matt, I'v tried with Splunk 4.1.3 the following search:
"| metadata type=hosts | eval age = strftime("%s","now") - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime"
and I got the following error:
"Error in 'eval' command: Typechecking failed. '-' only takes numbers."
Marco
If you have deployed a number of Splunk forwarders and they are all pushing data to Splunk, you might not notice if one of them goes out of service, because the other forwarders are still pushing data to Splunk. You can run the following search to detect forwarders that have been up in the last 24 hours but not in the last 2 minutes. It uses the forwarder heartbeat, which is a feature of Splunk versions 3.2 and later.
index=_internal sourcetype="fwd-hb" starthoursago=24 | dedup host | eval age = strftime("%s","now") - _time | search age > 120 age < 86000
You can set this search up as an alert every several minutes so that Splunk will let you know if any of your active forwarders have not responded in the last 2 minutes.
If you're running a version of Splunk that is later than 3.3, the heartbeat message is not longer sent. Use the following search instead:
index=_internal "group=tcpin_connections" | stats max(_time) as latest by sourceHost | eventstats max(latest) as latest_all | eval lag = latest_all - latest | where lag > 120 | fields sourceHost lag
The following search works in 3.4.5 and finds all hosts who haven't sent a message in the last 24 hours
| metadata type=hosts | eval age = strftime("%s","now") - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime
and in 4.0:
| metadata type=hosts | eval age = now() - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime
Another 4.0 variant
| metadata type=hosts | sort recentTime desc | convert ctime(recentTime) as Recent_Time
Caveat: Many of these methods do not account for decommissioned hosts, which you are bound to have after a length of time. These hosts will also show up in the search results, as they also fit the criteria. Incorporating a host tag ('decommisioned', etc) into this search may help with this, but requires you to tag known hosts that are no longer valid.