Hello,
We are running Splunk version 7.1.3.
We have 2 SHCs connected to our indexers. For one of the SHCs, the SHC members keep flickering between 'Up' and 'Down' status on the 'Indexer Clustering' page.
One of the previous posts suggested to increase 'generation_poll_interval' from 5 to 60 seconds. In our case, for members of both SHCs, 'generation_poll_interval' defaults to 5. The flickering status only happens for members of one SHC, and not the other.
Any further inputs on this behavior would be appreciated.
Thanks
You must be seeing errors in _internal for the SHC members which are at fault.
Can you post some of the messages you see?
Based on the information you supplied, I suspect that you are running into a split-brain situation.
Search head clustering should include no fewer than 3 nodes.
The three nodes make a "decision" on who should be captain based on "votes".
When you have only two, it becomes nearly impossible for them to agree/elect the leader, (quorum) and will lead to the situation you describe.
I initially read it that way too, but i think the question means 2 seperate SH clusters of x nodes.
Given the minimums you corectly state, that means at least 6 search head members, split across 2 SHCs.
At least thats my assumption..
Yes, that is correct. Though it is technically possible to cluster two nodes, it is not good practice and leads to these type of issues. You need at least 3 nodes per SHC. Otherwise, you'll continue to have split-brain issues.
For the record, split-brain is not unique to Splunk. You'll encounter it in any type of clustering with only two nodes. Two nodes can't establish quorum successfully (more often than not).
You must be seeing errors in _internal for the SHC members which are at fault.
Can you post some of the messages you see?
Hi Nick,
So I am seeing the following message for one of the search peers:
ERROR DistributedPeerManagerHeartbeat - Status 502 while sending public key to cluster search peer
WARN DistributedPeerManagerHeartbeat - Send failure while pushing PK to search peer, Connect Timeout
Apparently, the SHC member nodes cannot connect to just this search peer on port 8089. It seems this is the culprit which is causing the fluctuations in the status.
I will get this rectified and see if this alleviates the problem.
Thanks!
Sounds promising. Good luck
If my answer helped, please consider accepting and/or upvoting so that other memebers of the community can see it was useful.
As an update, there is a communication issue between the SHC nodes and just one indexer out of 46 that we have.
This seems to be causing the fluctuation in the status.
Thanks for your responses. This has been marked as 'Accepted'.