Getting Data In

How do I tell if a forwarder is down?

Alan_Bradley
Path Finder

How can you differentiate between a forwarder being down and a forwarder not having any data to send ? i.e is there a heartbeat that i can tap into?

1 Solution

matt
Splunk Employee
Splunk Employee

If you have deployed a number of Splunk forwarders and they are all pushing data to Splunk, you might not notice if one of them goes out of service, because the other forwarders are still pushing data to Splunk. You can run the following search to detect forwarders that have been up in the last 24 hours but not in the last 2 minutes. It uses the forwarder heartbeat, which is a feature of Splunk versions 3.2 and later.

index=_internal sourcetype="fwd-hb" starthoursago=24 | dedup host | eval age = strftime("%s","now") - _time | search age > 120 age < 86000

You can set this search up as an alert every several minutes so that Splunk will let you know if any of your active forwarders have not responded in the last 2 minutes.

If you're running a version of Splunk that is later than 3.3, the heartbeat message is not longer sent. Use the following search instead:

index=_internal "group=tcpin_connections" | stats max(_time) as latest by sourceHost | eventstats max(latest) as latest_all | eval lag = latest_all - latest | where lag > 120 | fields sourceHost lag

The following search works in 3.4.5 and finds all hosts who haven't sent a message in the last 24 hours

| metadata type=hosts | eval age = strftime("%s","now") - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime

and in 4.0:

| metadata type=hosts | eval age = now() - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime

Another 4.0 variant

| metadata type=hosts | sort recentTime desc | convert ctime(recentTime) as Recent_Time

Caveat: Many of these methods do not account for decommissioned hosts, which you are bound to have after a length of time. These hosts will also show up in the search results, as they also fit the criteria. Incorporating a host tag ('decommisioned', etc) into this search may help with this, but requires you to tag known hosts that are no longer valid.

View solution in original post

rameshyedurla
Explorer

try this
| metadata type=hosts

| eval lastHour=relative_time(now(),"-1h@h")
| eval yesterday=relative_time(now(), "-1d@d")
| where ( recentTime>yesterday AND recentTime

joseluisrespeto
Explorer

Good morning,

How can i check that the forwarders are sending the logs correctly? I have the following error in my logs:

"eventType=connect_fail" in metrics.log

metrics.log:12-17-2014 09:38:48.529 +0100 INFO StatusMgr - destHost=10.26.XX.XX, destIp=10.26.XX.XX, destPort=9997, eventType=connect_fail, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor

This event produce that the los are not been sending correctly, them i need know if any option in the program execution can check if splunk is sending or not the data to the server.

And, can i resolve this issue in the configuration with some parameter? This issue only appear in determinate times isn't fixed.

Thanks and regards.

0 Karma

marcoscala
Builder

Hi All!

I just made another test and changed a bit the logic.

I was looking for Forwarders not being sending data for more than let's say 2 minutes. Here's my latest version:

index=_internal "group=tcpin_connections" | stats max(_time) as latest by sourceHost | eval nowtime = now() | eval lag = (nowtime - latest)/60 | where lag > 2 | fields sourceHost latest lag

If you prefere, you can work on seconds of course instead of minutes 🙂

Marco

0 Karma

marcoscala
Builder

I'm going to test this later on 4.1.3 and let you know. I need to provide our Customer a dashboard to monitor all the remote forwarder at a glance.

Marco

0 Karma

marcoscala
Builder

Matt, I'v tried with Splunk 4.1.3 the following search:

"| metadata type=hosts | eval age = strftime("%s","now") - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime"

and I got the following error:

"Error in 'eval' command: Typechecking failed. '-' only takes numbers."

Marco

0 Karma

matt
Splunk Employee
Splunk Employee

If you have deployed a number of Splunk forwarders and they are all pushing data to Splunk, you might not notice if one of them goes out of service, because the other forwarders are still pushing data to Splunk. You can run the following search to detect forwarders that have been up in the last 24 hours but not in the last 2 minutes. It uses the forwarder heartbeat, which is a feature of Splunk versions 3.2 and later.

index=_internal sourcetype="fwd-hb" starthoursago=24 | dedup host | eval age = strftime("%s","now") - _time | search age > 120 age < 86000

You can set this search up as an alert every several minutes so that Splunk will let you know if any of your active forwarders have not responded in the last 2 minutes.

If you're running a version of Splunk that is later than 3.3, the heartbeat message is not longer sent. Use the following search instead:

index=_internal "group=tcpin_connections" | stats max(_time) as latest by sourceHost | eventstats max(latest) as latest_all | eval lag = latest_all - latest | where lag > 120 | fields sourceHost lag

The following search works in 3.4.5 and finds all hosts who haven't sent a message in the last 24 hours

| metadata type=hosts | eval age = strftime("%s","now") - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime

and in 4.0:

| metadata type=hosts | eval age = now() - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime

Another 4.0 variant

| metadata type=hosts | sort recentTime desc | convert ctime(recentTime) as Recent_Time

Caveat: Many of these methods do not account for decommissioned hosts, which you are bound to have after a length of time. These hosts will also show up in the search results, as they also fit the criteria. Incorporating a host tag ('decommisioned', etc) into this search may help with this, but requires you to tag known hosts that are no longer valid.

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...