Why is Search Head Cluster silently out of sync (v...

NullZero · ‎06-09-2023

I'm a Splunk PS admin working at a client site and I wanted to post a challenge and resolution that we encountered.

Problem:
Client reported missing knowledge objects in a custom app private area; they expected ~40 reports but only had ~17. The client last used the reports 7 days prior. Asked Splunk PS to investigate.

Environment:
3 instance SHC
Version 8.2.3, Linux
>15 Indexers
>50 users across the platform

Troubleshooting Approach:

Verified that the given Knowledge Objects (KO's) had not been deleted. Simple SPL search in index="_audit" for the app and verified last 10 days of logs. No suggestion or evidence of deletion.
Via CLI set path to the given custom app, listed out objects in savedsearches.conf, count was 17
cat savedsearches.conf | grep "\[" -P | wc
Changed SH to alternative member, repeated commands, count was 44. Verified the 3rd member also where the count was 44.
Conclusion, the member with 17 savedsearches was clearly out of sync and did not have all recent KO's.
Checked the Captaincy ./splunk show shclusters-status --verbose all appeared correct.
The member with limited objects was the current captain, out_of_sync_node : 0 on all three instances in the cluster.

Remediation:

Verified the Monitoring Console, no alerts listed, health check issues or evidence of errors.
Created a backup of this users savedsearches.conf (on one instance)
cp savedsearches.conf savedsearches.bak
Following the Splunk Docs SHC: perform a manual resync we moved the captain to an instance with the correct number of KO's
./splunk transfer shcluster-captain -mgmt_uri https://<server>:8089
Carefully issued the destructive command onto the out-of-sync instance:
./splunk resync shcluster-replicated-config
Repeated this for the second SHC member
Repeated checks all three members now in-sync

Post works:

We were unable to locate a release notes item that suggests this is a bug.
There had previously been a period of downtime for the out-of-sync member, its Splunk daemon had stopped following a push from the Deployer.
Still no alerts in the MC, nor logs per the docs to indicate e.g.
Error pulling configurations from the search head cluster captain; consider performing a destructive configuration resync on this search head cluster member.

Conclusions:

The cluster was silently Out-of-Sync
Many KO's across multiple apps would have been affected
Follow the Splunk Docs
Recommend to client to upgrade to latest version 9.x.

murugansplunkin · ‎12-19-2023

That's an awesome explanation @NullZero.... We are facing similar issues, but sort of different way...

We have 2 node Search Head Cluster... among which one is static captain... another one is a member.
Often the non-captain member goes out of cluster (It is not showing in the Search head clustering page).. every time we are manually restarting the Splunk or the entire EC2 of the member.. then it is showing in the cluster page....

Can i use the re-sync command to solve the issue, instead of restarting the Splunk or EC2? will it help?

Thanks for your help 😊

splunkoptimus · ‎01-09-2024

Check the logs for connectivity issues.

splunkoptimus · ‎11-23-2023

Saved my day😁

Why is Search Head Cluster silently out of sync (version 8.2.3)?

administration

troubleshooting

Introducing the Splunk Community Dashboard Challenge!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

Wondering How to Build Resiliency in the Cloud?