topic Re: Correct way of testing forwarder_site_failover feature in Splunk Enterprise

Correct way of testing forwarder_site_failover feature

PT_crusher — Thu, 15 Jul 2021 20:04:07 GMT

We have a multi-site installation of Splunk and would like to test if the forwarder_site_failover is working properly. In the forwarders output.conf we have the following

[indexer_discovery:master1] pass4SymmKey = $secretstuff$ master_uri = https://yadayada:8089 [tcpout:group1] indexerDiscovery = master1 useACK = false clientCert = /opt/splunk/etc/auth/certs/s2s.pem sslRootCAPath = /opt/splunk/etc/auth/certs/ca.crt [tcpout] forceTimebasedAutoLB = true autoLBFrequency = 30 defaultGroup = group1

As far as the yadayada clustermaster server goes, we have the following config:

/opt/splunk/etc/apps/clustermaster_base_conf/default/server.conf [clustering] (...) /opt/splunk/etc/apps/clustermaster_base_conf/default/server.conf forwarder_site_failover = site1:site2

One thing that I was trying to figure out was the need to explicitly set site2:site1 or if the existing configuration was enough for failing over from site1 to site2 and from site2 to site1.

My approach was to shut the connection between the forwarder and the site1 indexers by setting iptable rules in the indexers that DROP the connections from the forwarder.

#e.g. iptables rule iptables -I INPUT 1 -s <forwarder ip> -p tcp --dport 9997 -j DROP #forwarder splunkd.log 07-15-2021 16:20:41.729 +0000 WARN TcpOutputProc - Cooked connection to ip=<site1 indexer ip>:9997 timed out

The iptables rules didn't make the forwarder failover so, i wonder if the failover process only kicks when the clustermaster loses the visibility over the indexers. In the live setup this seems more risky and less flexible.

What is the recommended approach to perform this kind of testing?

Re: Correct way of testing forwarder_site_failover feature

codebuilder — Thu, 15 Jul 2021 20:37:57 GMT

I see that you have indexer discovery configured, which is great. When you have this configured, the forwarder will poll the master for a list of available indexers and it will use that list until the next polling cycle.

In order to test the failover using the iptables method you posted about, you would have to wait until the polling cycle expires. Only then would the forwarder get a new list of indexers from the master.

The polling frequency is based on an algorithm which is detailed in the documentation below (or you can set it manually).

https://docs.splunk.com/Documentation/Splunk/8.2.1/Indexer/indexerdiscovery#Adjust_the_frequency_of_polling

Re: Correct way of testing forwarder_site_failover feature

PT_crusher — Fri, 16 Jul 2021 15:00:14 GMT

So I followed the link and explicitly set the polling_rate to 10 in the cluster master. This would give me a polling cycle interval of roughly 4 minutes.

Regardless of this fact I just see the logs getting filled with

07-16-2021 15:55:33.324 +0100 WARN TcpOutputProc - Cooked connection to ip=<multiple site1 indexer IPs>:9997 timed out

If i bump the TcpOutputProc logs to DEBUG the state machine is transitioning between

Connector::runCookedStateMachine in state=eInit

and

Connector::runCookedStateMachine in state=eV3ConnectTimedOut

but still, all attempts are made to site1, i don't feel that the failover is triggered at all.

If we change the server.conf of the forwarder and set the site to site2 it is able to send the logs via site2 indexers 😕

Re: Correct way of testing forwarder_site_failover feature

PT_crusher — Fri, 16 Jul 2021 17:13:48 GMT

07-16-2021 17:07:58.143 +0000 DEBUG IndexerDiscoveryHeartbeatThread - Connector::runCookedStateMachine in state=eV3ConnectDone for <site2 indexer>:9997 07-16-2021 17:07:58.143 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState Ping failover connection to idx=<site2 indexer>:9997 successful. 07-16-2021 17:07:58.143 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState enable the failover to 4 07-16-2021 17:07:58.944 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over 07-16-2021 17:07:58.944 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10 07-16-2021 17:07:58.945 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up 07-16-2021 17:08:01.850 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over 07-16-2021 17:08:01.850 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10 07-16-2021 17:08:01.850 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up 07-16-2021 17:08:04.754 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over 07-16-2021 17:08:04.755 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10 07-16-2021 17:08:04.755 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up 07-16-2021 17:08:07.611 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over 07-16-2021 17:08:07.611 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10 07-16-2021 17:08:07.611 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up 07-16-2021 17:08:07.913 +0000 WARN TcpOutputProc - Cooked connection to ip=<site1 indexer>:9997 timed out 07-16-2021 17:08:10.518 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over 07-16-2021 17:08:10.518 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10 07-16-2021 17:08:10.518 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up

After setting IndexerDiscoveryHeartbeatThread to DEBUG i confirmed my suspicious. Forwarder is not failingover because it thinks site1 is up

netstat -tunap | grep "<ip of the forwarder VM>" <------- no ESTABLISHED connections in any of site1 indexers

Re: Correct way of testing forwarder_site_failover feature

PT_crusher — Tue, 03 Aug 2021 16:05:03 GMT

Turns out the it is up to the clustermaster to loose connectivity to the indexers.

Clustermaster will them instruct the forwarder to failover.

Hope it helps