Splunk Enterprise

Why is Search Head Cluster silently out of sync (version 8.2.3)?

NullZero
Path Finder

I'm a Splunk PS admin working at a client site and I wanted to post a challenge and resolution that we encountered.

Problem:
Client reported missing knowledge objects in a custom app private area; they expected ~40 reports but only had ~17. The client last used the reports 7 days prior. Asked Splunk PS to investigate.

Environment:
3 instance SHC
Version 8.2.3, Linux
>15 Indexers
>50 users across the platform

Troubleshooting Approach:

  • Verified that the given Knowledge Objects (KO's) had not been deleted. Simple SPL search in index="_audit" for the app and verified last 10 days of logs. No suggestion or evidence of deletion.
  • Via CLI set path to the given custom app, listed out objects in savedsearches.conf, count was 17
  • cat savedsearches.conf | grep "\[" -P | wc
  • Changed SH to alternative member, repeated commands, count was 44. Verified the 3rd member also where the count was 44.
  • Conclusion, the member with 17 savedsearches was clearly out of sync and did not have all recent KO's.
  • Checked the Captaincy ./splunk show shclusters-status --verbose all appeared correct.
  • The member with limited objects was the current captain, out_of_sync_node : 0 on all three instances in the cluster.

Remediation:

  • Verified the Monitoring Console, no alerts listed, health check issues or evidence of errors.
  • Created a backup of this users savedsearches.conf (on one instance)
  • cp savedsearches.conf savedsearches.bak
  • Following the Splunk Docs SHC: perform a manual resync we moved the captain to an instance with the correct number of KO's
  • ./splunk transfer shcluster-captain -mgmt_uri https://<server>:8089
  • Carefully issued the destructive command onto the out-of-sync instance:
  • ./splunk resync shcluster-replicated-config
  • Repeated this for the second SHC member
  • Repeated checks all three members now in-sync

Post works:

  • We were unable to locate a release notes item that suggests this is a bug.
  • There had previously been a period of downtime for the out-of-sync member, its Splunk daemon had stopped following a push from the Deployer.
  • Still no alerts in the MC, nor logs per the docs to indicate e.g.
  • Error pulling configurations from the search head cluster captain; consider performing a destructive configuration resync on this search head cluster member.

Conclusions:

  • The cluster was silently Out-of-Sync
  • Many KO's across multiple apps would have been affected
  • Follow the Splunk Docs
  • Recommend to client to upgrade to latest version 9.x. 
Labels (2)

murugansplunkin
Engager

That's an awesome explanation @NullZero.... We are facing similar issues, but sort of different way...

We have 2 node Search Head Cluster... among which one is static captain... another one is a member.
Often the non-captain member goes out of cluster (It is not showing in the Search head clustering page).. every time we are manually restarting the Splunk or the entire EC2 of the member.. then it is showing in the cluster page....

Can i use the re-sync command to solve the issue, instead of restarting the Splunk or EC2? will it help?

Thanks for your help 😊

0 Karma

splunkoptimus
Path Finder

Check the logs for connectivity issues. 

splunkoptimus
Path Finder

Saved my day😁

0 Karma
Get Updates on the Splunk Community!

Introducing the Splunk Community Dashboard Challenge!

Welcome to Splunk Community Dashboard Challenge! This is your chance to showcase your skills in creating ...

Get the T-shirt to Prove You Survived Splunk University Bootcamp

As if Splunk University, in Las Vegas, in-person, with three days of bootcamps and labs weren’t enough, now ...

Wondering How to Build Resiliency in the Cloud?

IT leaders are choosing Splunk Cloud as an ideal cloud transformation platform to drive business resilience,  ...