I'm a Splunk PS admin working at a client site and I wanted to post a challenge and resolution that we encountered.
Problem:
Client reported missing knowledge objects in a custom app private area; they expected ~40 reports but only had ~17. The client last used the reports 7 days prior. Asked Splunk PS to investigate.
Environment:
3 instance SHC
Version 8.2.3, Linux
>15 Indexers
>50 users across the platform
Troubleshooting Approach:
- Verified that the given Knowledge Objects (KO's) had not been deleted. Simple SPL search in index="_audit" for the app and verified last 10 days of logs. No suggestion or evidence of deletion.
- Via CLI set path to the given custom app, listed out objects in savedsearches.conf, count was 17
- cat savedsearches.conf | grep "\[" -P | wc
- Changed SH to alternative member, repeated commands, count was 44. Verified the 3rd member also where the count was 44.
- Conclusion, the member with 17 savedsearches was clearly out of sync and did not have all recent KO's.
- Checked the Captaincy ./splunk show shclusters-status --verbose all appeared correct.
- The member with limited objects was the current captain, out_of_sync_node : 0 on all three instances in the cluster.
Remediation:
- Verified the Monitoring Console, no alerts listed, health check issues or evidence of errors.
- Created a backup of this users savedsearches.conf (on one instance)
- cp savedsearches.conf savedsearches.bak
- Following the Splunk Docs SHC: perform a manual resync we moved the captain to an instance with the correct number of KO's
- ./splunk transfer shcluster-captain -mgmt_uri https://<server>:8089
- Carefully issued the destructive command onto the out-of-sync instance:
- ./splunk resync shcluster-replicated-config
- Repeated this for the second SHC member
- Repeated checks all three members now in-sync
Post works:
- We were unable to locate a release notes item that suggests this is a bug.
- There had previously been a period of downtime for the out-of-sync member, its Splunk daemon had stopped following a push from the Deployer.
- Still no alerts in the MC, nor logs per the docs to indicate e.g.
- Error pulling configurations from the search head cluster captain; consider performing a destructive configuration resync on this search head cluster member.
Conclusions:
- The cluster was silently Out-of-Sync
- Many KO's across multiple apps would have been affected
- Follow the Splunk Docs
- Recommend to client to upgrade to latest version 9.x.