Splunk Enterprise

Correct way of testing forwarder_site_failover feature

PT_crusher
Explorer

We have a multi-site installation of Splunk and would like to test if the forwarder_site_failover is working properly. In the forwarders output.conf we have the following

 

 

 

[indexer_discovery:master1]
pass4SymmKey = $secretstuff$
master_uri = https://yadayada:8089

[tcpout:group1]
indexerDiscovery = master1
useACK = false
clientCert = /opt/splunk/etc/auth/certs/s2s.pem
sslRootCAPath = /opt/splunk/etc/auth/certs/ca.crt

[tcpout]
forceTimebasedAutoLB = true
autoLBFrequency = 30
defaultGroup = group1

 

 

 

As far as the yadayada clustermaster server goes, we have the following config:

 

 

 

/opt/splunk/etc/apps/clustermaster_base_conf/default/server.conf     [clustering]
(...)
/opt/splunk/etc/apps/clustermaster_base_conf/default/server.conf     forwarder_site_failover = site1:site2

 

 

 

One thing that I was trying to figure out was the need to explicitly set site2:site1 or if the existing configuration was enough for failing over from site1 to site2 and from site2 to site1.

My approach was to shut the connection between the forwarder and the site1 indexers by setting iptable rules in the indexers that DROP the connections from the forwarder.

 

 

 

#e.g. iptables rule
iptables -I INPUT 1 -s <forwarder ip> -p tcp --dport 9997 -j DROP

#forwarder splunkd.log
07-15-2021 16:20:41.729 +0000 WARN  TcpOutputProc - Cooked connection to ip=<site1 indexer ip>:9997 timed out

 

 

 

The iptables rules didn't make the forwarder failover so, i wonder if the failover process only kicks when the clustermaster loses the visibility over the indexers. In the live setup this seems more risky and less flexible.

What is the recommended approach to perform this kind of testing?

 

Labels (2)
0 Karma
1 Solution

PT_crusher
Explorer

Turns out the it is up to the clustermaster to loose connectivity to the indexers.

Clustermaster will them instruct the forwarder to failover.

Hope it helps

View solution in original post

0 Karma

codebuilder
SplunkTrust
SplunkTrust

I see that you have indexer discovery configured, which is great. When you have this configured, the forwarder will poll the master for a list of available indexers and it will use that list until the next polling cycle.

In order to test the failover using the iptables method you posted about, you would have to wait until the polling cycle expires. Only then would the forwarder get a new list of indexers from the master.

The polling frequency is based on an algorithm which is detailed in the documentation below (or you can set it manually).

https://docs.splunk.com/Documentation/Splunk/8.2.1/Indexer/indexerdiscovery#Adjust_the_frequency_of_...

----
An upvote would be appreciated and Accept Solution if it helps!
0 Karma

PT_crusher
Explorer

So I followed the link and explicitly set the polling_rate to 10 in the cluster master. This would give me a polling cycle interval of roughly 4 minutes.

Regardless of this fact I just see the logs getting filled with 

07-16-2021 15:55:33.324 +0100 WARN TcpOutputProc - Cooked connection to ip=<multiple site1 indexer IPs>:9997 timed out 

If i bump the TcpOutputProc logs to DEBUG the state machine is transitioning between 

Connector::runCookedStateMachine in state=eInit

and

Connector::runCookedStateMachine in state=eV3ConnectTimedOut

but still, all attempts are made to site1, i don't feel that the failover is triggered at all.

If we change the server.conf of the forwarder and set the site to site2 it is able to send the logs via site2 indexers 😕

0 Karma

PT_crusher
Explorer
07-16-2021 17:07:58.143 +0000 DEBUG IndexerDiscoveryHeartbeatThread - Connector::runCookedStateMachine in state=eV3ConnectDone for <site2 indexer>:9997
07-16-2021 17:07:58.143 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState Ping failover connection to idx=<site2 indexer>:9997 successful.
07-16-2021 17:07:58.143 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState enable the failover to 4
07-16-2021 17:07:58.944 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:07:58.944 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:07:58.945 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up
07-16-2021 17:08:01.850 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:08:01.850 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:08:01.850 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up
07-16-2021 17:08:04.754 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:08:04.755 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:08:04.755 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up
07-16-2021 17:08:07.611 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:08:07.611 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:08:07.611 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up
07-16-2021 17:08:07.913 +0000 WARN TcpOutputProc - Cooked connection to ip=<site1 indexer>:9997 timed out
07-16-2021 17:08:10.518 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is enabled to fail over
07-16-2021 17:08:10.518 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState current connection status 10
07-16-2021 17:08:10.518 +0000 DEBUG IndexerDiscoveryHeartbeatThread - TcpOutputGroupState is not failing over to the failover site because the primary site is up

 

After setting IndexerDiscoveryHeartbeatThread to DEBUG i confirmed my suspicious. Forwarder is not failingover because it thinks site1 is up

netstat -tunap | grep "<ip of the forwarder VM>" <------- no ESTABLISHED connections in any of site1 indexers

0 Karma

PT_crusher
Explorer

Turns out the it is up to the clustermaster to loose connectivity to the indexers.

Clustermaster will them instruct the forwarder to failover.

Hope it helps

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...