I have this error:
Error pulling configurations from the search head cluster captain (https://192.168.221.101:8089); consider performing a destructive configuration resync on this search head cluster member.
On the machine generating the error (192.168.221.103), I run "splunk resync shcluster-replicated-config". I get the following error:
"ConfReplicationException: Error downloading snapshot: Network-layer error: Winsock error 10054"
In splunkd.log, I get:
"Error in RestConfRepoProxy::fetchFrom, at=: Non-200 status_code=500: refuse request without valid baseline; snapshot exists at op_id=05f70b29f3775768ee85212227c8ecd3983235c8 for repo=https://192.168.221.101:8089"
I have restarted, rebooted and reevaluated my sanity. Suggestions?
I suspect that the problem was somewhere in the dynamic configuration files.
In search head clustering, things are not static. The search head captain can change over time, the cluster members must have a shared view of artifacts and scheduled jobs, etc.
Therefore, there is a fair amount of configuration and state information that is not stored in the traditional etc directories - etc contains only static configuration files.
In Splunk, all of the directories that can grow in size are located under var
.../var/log/splunk contains Splunk's internal logs
.../var/lib/splunk is the default location for Splunk indexes
.../var/run - contains a lot of state information and some configuration information - including the results of search jobs, the search bundles, and search head cluster info.
Some of the search head cluster info is also in memory.
Someone with more knowledge can probably explain it better, and give you a better set of diagnostics and remedies. But when I need to take a hammer to a search head cluster member, I usually
1 - look at the internal logs on the SHC member in case I am missing something obvious
2 - attempt a rolling restart of the search head cluster
3 - stop the SHC member
4 - remove all the files and directories from the var/run directory (this is the hammer)
5 - restart the SHC member
6 - do a manual resync if needed with the rest of the cluster
Just remember that this is truly taking a hammer to that instance. Any saved search results will be lost as well as any current search results. This may not be an okay thing to do in a production environment. But if the search head is well and truly broken...
This should not overlap the "splunk clean all", so there may be cases where you do both. (Personally, I just inserted "splunk clean all" after step 3 in my procedure.) In either case, the SHC member should rebuild all of its configuration files (static and dynamic) by syncing with the deployer and the search head captain once it comes back up.