Solved: Correct way of testing forwarder_site_failover fea...

PT_crusher · ‎07-15-2021

We have a multi-site installation of Splunk and would like to test if the forwarder_site_failover is working properly. In the forwarders output.conf we have the following

[indexer_discovery:master1]
pass4SymmKey = $secretstuff$
master_uri = https://yadayada:8089

[tcpout:group1]
indexerDiscovery = master1
useACK = false
clientCert = /opt/splunk/etc/auth/certs/s2s.pem
sslRootCAPath = /opt/splunk/etc/auth/certs/ca.crt

[tcpout]
forceTimebasedAutoLB = true
autoLBFrequency = 30
defaultGroup = group1

As far as the yadayada clustermaster server goes, we have the following config:

/opt/splunk/etc/apps/clustermaster_base_conf/default/server.conf     [clustering]
(...)
/opt/splunk/etc/apps/clustermaster_base_conf/default/server.conf     forwarder_site_failover = site1:site2

One thing that I was trying to figure out was the need to explicitly set site2:site1 or if the existing configuration was enough for failing over from site1 to site2 and from site2 to site1.

My approach was to shut the connection between the forwarder and the site1 indexers by setting iptable rules in the indexers that DROP the connections from the forwarder.

#e.g. iptables rule
iptables -I INPUT 1 -s <forwarder ip> -p tcp --dport 9997 -j DROP

#forwarder splunkd.log
07-15-2021 16:20:41.729 +0000 WARN  TcpOutputProc - Cooked connection to ip=<site1 indexer ip>:9997 timed out

The iptables rules didn't make the forwarder failover so, i wonder if the failover process only kicks when the clustermaster loses the visibility over the indexers. In the live setup this seems more risky and less flexible.

What is the recommended approach to perform this kind of testing?

PT_crusher · ‎08-03-2021

Turns out the it is up to the clustermaster to loose connectivity to the indexers.

Clustermaster will them instruct the forwarder to failover.

Hope it helps

View solution in original post

codebuilder · ‎07-15-2021

I see that you have indexer discovery configured, which is great. When you have this configured, the forwarder will poll the master for a list of available indexers and it will use that list until the next polling cycle.

In order to test the failover using the iptables method you posted about, you would have to wait until the polling cycle expires. Only then would the forwarder get a new list of indexers from the master.

The polling frequency is based on an algorithm which is detailed in the documentation below (or you can set it manually).

https://docs.splunk.com/Documentation/Splunk/8.2.1/Indexer/indexerdiscovery#Adjust_the_frequency_of_...

----
An upvote would be appreciated and Accept Solution if it helps!

PT_crusher · ‎07-16-2021

So I followed the link and explicitly set the polling_rate to 10 in the cluster master. This would give me a polling cycle interval of roughly 4 minutes.

Regardless of this fact I just see the logs getting filled with

07-16-2021 15:55:33.324 +0100 WARN TcpOutputProc - Cooked connection to ip=<multiple site1 indexer IPs>:9997 timed out

If i bump the TcpOutputProc logs to DEBUG the state machine is transitioning between

Connector::runCookedStateMachine in state=eInit

and

Connector::runCookedStateMachine in state=eV3ConnectTimedOut

but still, all attempts are made to site1, i don't feel that the failover is triggered at all.

If we change the server.conf of the forwarder and set the site to site2 it is able to send the logs via site2 indexers 😕

PT_crusher · ‎07-16-2021

07-16-2021 17:07:58.143 +0000 DEBUG IndexerDiscoveryHeartbeatThread - Connector::runCookedStateMachine in state=eV3ConnectDone for <site2 indexer>:9997
07-16-2021 17:07:58.143 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState Ping failover connection to idx=<site2 indexer>:9997 successful.
07-16-2021 17:07:58.143 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState enable the failover to 4
07-16-2021 17:07:58.944 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:07:58.944 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:07:58.945 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up
07-16-2021 17:08:01.850 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:08:01.850 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:08:01.850 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up
07-16-2021 17:08:04.754 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:08:04.755 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:08:04.755 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up
07-16-2021 17:08:07.611 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:08:07.611 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:08:07.611 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up
07-16-2021 17:08:07.913 +0000 WARN TcpOutputProc - Cooked connection to ip=<site1 indexer>:9997 timed out
07-16-2021 17:08:10.518 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:08:10.518 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:08:10.518 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up

After setting IndexerDiscoveryHeartbeatThread to DEBUG i confirmed my suspicious. Forwarder is not failingover because it thinks site1 is up

netstat -tunap | grep "<ip of the forwarder VM>" <------- no ESTABLISHED connections in any of site1 indexers

PT_crusher · ‎08-03-2021

Turns out the it is up to the clustermaster to loose connectivity to the indexers.

Clustermaster will them instruct the forwarder to failover.

Hope it helps

Correct way of testing forwarder_site_failover feature

configuration

using Splunk Enterprise

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation