I have this error:
Error pulling configurations from the search head cluster captain (https://192.168.221.101:8089); consider performing a destructive configuration resync on this search head cluster member.
On the machine generating the error (192.168.221.103), I run "splunk resync shcluster-replicated-config". I get the following error:
"ConfReplicationException: Error downloading snapshot: Network-layer error: Winsock error 10054"
In splunkd.log, I get:
"Error in RestConfRepoProxy::fetchFrom, at=: Non-200 status_code=500: refuse request without valid baseline; snapshot exists at op_id=05f70b29f3775768ee85212227c8ecd3983235c8 for repo=https://192.168.221.101:8089"
I have restarted, rebooted and reevaluated my sanity. Suggestions?
I would try (if possible) to do a rolling restart of the SHC members first, and then try to do destructive resync if necessary.
Already tried that and failed. I have now removed the offending node from the cluster, run a "Splunk clean all" and re-added it back in. It appears to be functioning properly now.
Good that your issue is resolved, bad that it has to be done that way (well my method was not that great either). I hope someone can suggest something cleaner/better than this.
I suspect that the problem was somewhere in the dynamic configuration files.
In search head clustering, things are not static. The search head captain can change over time, the cluster members must have a shared view of artifacts and scheduled jobs, etc.
Therefore, there is a fair amount of configuration and state information that is not stored in the traditional etc directories - etc contains only static configuration files.
In Splunk, all of the directories that can grow in size are located under var
.../var/log/splunk contains Splunk's internal logs
.../var/lib/splunk is the default location for Splunk indexes
.../var/run - contains a lot of state information and some configuration information - including the results of search jobs, the search bundles, and search head cluster info.
Some of the search head cluster info is also in memory.
Someone with more knowledge can probably explain it better, and give you a better set of diagnostics and remedies. But when I need to take a hammer to a search head cluster member, I usually
1 - look at the internal logs on the SHC member in case I am missing something obvious
2 - attempt a rolling restart of the search head cluster
3 - stop the SHC member
4 - remove all the files and directories from the var/run directory (this is the hammer)
5 - restart the SHC member
6 - do a manual resync if needed with the rest of the cluster
Just remember that this is truly taking a hammer to that instance. Any saved search results will be lost as well as any current search results. This may not be an okay thing to do in a production environment. But if the search head is well and truly broken...
This should not overlap the "splunk clean all", so there may be cases where you do both. (Personally, I just inserted "splunk clean all" after step 3 in my procedure.) In either case, the SHC member should rebuild all of its configuration files (static and dynamic) by syncing with the deployer and the search head captain once it comes back up.
Still having this problem occur randomly on all three of my clustered search heads. No clue as to cause. It seems to move at its own leisure for no rhyme or reason.