Solved: SHC member nodes status flickering on the 'indexer...

rahul_bhatia · ‎02-17-2020

Hello,

We are running Splunk version 7.1.3.

We have 2 SHCs connected to our indexers. For one of the SHCs, the SHC members keep flickering between 'Up' and 'Down' status on the 'Indexer Clustering' page.

One of the previous posts suggested to increase 'generation_poll_interval' from 5 to 60 seconds. In our case, for members of both SHCs, 'generation_poll_interval' defaults to 5. The flickering status only happens for members of one SHC, and not the other.

Any further inputs on this behavior would be appreciated.

Thanks

nickhills · ‎02-17-2020

You must be seeing errors in _internal for the SHC members which are at fault.
Can you post some of the messages you see?

If my comment helps, please give it a thumbs up!

View solution in original post

codebuilder · ‎02-17-2020

Based on the information you supplied, I suspect that you are running into a split-brain situation.
Search head clustering should include no fewer than 3 nodes.
The three nodes make a "decision" on who should be captain based on "votes".
When you have only two, it becomes nearly impossible for them to agree/elect the leader, (quorum) and will lead to the situation you describe.

----
An upvote would be appreciated and Accept Solution if it helps!

nickhills · ‎02-17-2020

I initially read it that way too, but i think the question means 2 seperate SH clusters of x nodes.
Given the minimums you corectly state, that means at least 6 search head members, split across 2 SHCs.
At least thats my assumption..

If my comment helps, please give it a thumbs up!

codebuilder · ‎02-17-2020

Yes, that is correct. Though it is technically possible to cluster two nodes, it is not good practice and leads to these type of issues. You need at least 3 nodes per SHC. Otherwise, you'll continue to have split-brain issues.

----
An upvote would be appreciated and Accept Solution if it helps!

codebuilder · ‎02-17-2020

For the record, split-brain is not unique to Splunk. You'll encounter it in any type of clustering with only two nodes. Two nodes can't establish quorum successfully (more often than not).

----
An upvote would be appreciated and Accept Solution if it helps!

nickhills · ‎02-17-2020

You must be seeing errors in _internal for the SHC members which are at fault.
Can you post some of the messages you see?

If my comment helps, please give it a thumbs up!

rahul_bhatia · ‎02-17-2020

Hi Nick,

So I am seeing the following message for one of the search peers:

ERROR DistributedPeerManagerHeartbeat - Status 502 while sending public key to cluster search peer
WARN DistributedPeerManagerHeartbeat - Send failure while pushing PK to search peer, Connect Timeout

Apparently, the SHC member nodes cannot connect to just this search peer on port 8089. It seems this is the culprit which is causing the fluctuations in the status.

I will get this rectified and see if this alleviates the problem.

Thanks!

nickhills · ‎02-17-2020

Sounds promising. Good luck

If my comment helps, please give it a thumbs up!

nickhills · ‎02-26-2020

If my answer helped, please consider accepting and/or upvoting so that other memebers of the community can see it was useful.

If my comment helps, please give it a thumbs up!

rahul_bhatia · ‎02-27-2020

As an update, there is a communication issue between the SHC nodes and just one indexer out of 46 that we have.

This seems to be causing the fluctuation in the status.

Thanks for your responses. This has been marked as 'Accepted'.

SHC member nodes status flickering on the 'indexer clustering' page

Detector Best Practices: Static Thresholds

Expert Tips from Splunk Education, Observability in Action, Plus More New Articles on ...

Changes to Splunk Instructor-Led Training Completion Criteria