We've SH Cluster environment and are seeing the following error ;
"Gave up waiting for the captain to establish a common bundle version across all search peers; using most recent bundles on all peers instead"
After some re-search and looking through answers site, this could be due to inconsistent distsearch.conf on some of the search heads in the cluster ; so I updated and removed all the values to servers key in distsearch.conf on all the search heads in the cluster and restarted splunk; but immediately following restart the changes made are overridden and restored to old distsearch.conf file. We're not deploying this file with these changes using deployer.
Following was done (multiple times) on each search head in the cluster (IPs hashed for security purposes) -
cat /opt/splunk/etc/system/local/distsearch.conf
[distributedSearch]
servers = https://10.xxx.36.000:8089,https://10.xxx.46.00:8089,https://10.xxx.46.00:8089,https://10.xxx.46.00:...
Changed distsearch.conf to
[distributedSearch]
servers =
We even tried to delete the distsearch.conf file across all the search heads in the cluster , followed by restarting all the members, but the distsearch.conf file gets recreated.
output of btool command on distsearch from one of the affected search heads in the cluster. I have checked for any monitoring/CM tool, but we don't have any to manage splunk process.
[spnksvc@ep3vmnspk199 bin]$ ./splunk cmd btool distsearch list --debug
/opt/splunk/etc/system/default/distsearch.conf [bundleEnforcerBlacklist]
/opt/splunk/etc/system/default/distsearch.conf [bundleEnforcerWhitelist]
/opt/splunk/etc/apps/splunk_dist_conf/default/distsearch.conf [distributedSearch]
/opt/splunk/etc/system/default/distsearch.conf authTokenConnectionTimeout = 5
/opt/splunk/etc/system/default/distsearch.conf authTokenReceiveTimeout = 10
/opt/splunk/etc/system/default/distsearch.conf authTokenSendTimeout = 10
/opt/splunk/etc/system/default/distsearch.conf bestEffortSearch = false
/opt/splunk/etc/system/default/distsearch.conf connectionTimeout = 10
/opt/splunk/etc/system/default/distsearch.conf defaultUriScheme = https
/opt/splunk/etc/apps/splunk_dist_conf/default/distsearch.conf disabled = 0
/opt/splunk/etc/system/default/distsearch.conf receiveTimeout = 600
/opt/splunk/etc/system/default/distsearch.conf sendTimeout = 30
/opt/splunk/etc/apps/splunk_dist_conf/default/distsearch.conf serverTimeout = 900
/opt/splunk/etc/system/local/distsearch.conf servers = https://10.xxx.36.000:8089,https://10.xxx.46.00:8089,https://10.xxx.46.00:8089,https://10.xxx.46.00:...
/opt/splunk/etc/system/default/distsearch.conf shareBundles = true
/opt/splunk/etc/apps/splunk_dist_conf/default/distsearch.conf statusTimeout = 900
/opt/splunk/etc/system/default/distsearch.conf useSHPBundleReplication = true
/opt/splunk/etc/apps/Splunk_TA_windows/default/distsearch.conf [replicationBlacklist]
/opt/splunk/etc/apps/splunk_app_windows_infrastructure/default/distsearch.conf MSAD_lookups = .../splunk_app_windows_infrastructure/lookups/(tHostInfo|tSessions).csv$
/opt/splunk/etc/system/default/distsearch.conf conf = (system|(apps/))/(default|local)/server.conf
/opt/splunk/etc/system/default/distsearch.conf framework = apps/framework/...
/opt/splunk/etc/system/default/distsearch.conf lookupindexfiles = (system|apps/|users(/reserved)?//)/lookups/.(tmp$|index($|/...))
/opt/splunk/etc/apps/splunk_dist_conf/default/distsearch.conf noBinDir = (.../bin/)
/opt/splunk/etc/apps/Splunk_TA_windows/default/distsearch.conf nontsyslogmappings = ...ntsyslog_mappings.csv
/opt/splunk/etc/system/default/distsearch.conf sampleapp = apps/sample_app/...
/opt/splunk/etc/system/default/distsearch.conf user_specific_meta = users(/_reserved)?///metadata/local.meta
/opt/splunk/etc/apps/splunk_dist_conf/default/distsearch.conf [replicationSettings]
/opt/splunk/etc/system/default/distsearch.conf allowDeltaUpload = true
/opt/splunk/etc/system/default/distsearch.conf allowSkipEncoding = true
/opt/splunk/etc/system/default/distsearch.conf allowStreamUpload = auto
/opt/splunk/etc/system/default/distsearch.conf concerningReplicatedFileSize = 500
/opt/splunk/etc/system/default/distsearch.conf connectionTimeout = 60
/opt/splunk/etc/system/default/distsearch.conf excludeReplicatedLookupSize = 0
/opt/splunk/etc/apps/splunk_dist_conf/default/distsearch.conf maxBundleSize = 14438892420
/opt/splunk/etc/system/default/distsearch.conf maxMemoryBundleSize = 10
/opt/splunk/etc/apps/splunk_dist_conf/default/distsearch.conf replicationThreads = 8
/opt/splunk/etc/system/default/distsearch.conf sanitizeMetaFiles = true
/opt/splunk/etc/system/default/distsearch.conf sendRcvTimeout = 60
/opt/splunk/etc/system/default/distsearch.conf [replicationSettings:refineConf]
/opt/splunk/etc/system/default/distsearch.conf replicate.app = true
/opt/splunk/etc/system/default/distsearch.conf replicate.authorize = true
/opt/splunk/etc/system/default/distsearch.conf replicate.collections = true
/opt/splunk/etc/system/default/distsearch.conf replicate.commands = true
/opt/splunk/etc/system/default/distsearch.conf replicate.eventtypes = true
/opt/splunk/etc/system/default/distsearch.conf replicate.fields = true
/opt/splunk/etc/system/default/distsearch.conf replicate.literals = true
/opt/splunk/etc/system/default/distsearch.conf replicate.lookups = true
/opt/splunk/etc/system/default/distsearch.conf replicate.multikv = true
/opt/splunk/etc/system/default/distsearch.conf replicate.props = true
/opt/splunk/etc/system/default/distsearch.conf replicate.segmenters = true
/opt/splunk/etc/system/default/distsearch.conf replicate.tags = true
/opt/splunk/etc/system/default/distsearch.conf replicate.transactiontypes = true
/opt/splunk/etc/system/default/distsearch.conf replicate.transforms = true
/opt/splunk/etc/system/default/distsearch.conf [replicationWhitelist]
/opt/splunk/etc/system/default/distsearch.conf kvstore = kvstore/...
/opt/splunk/etc/system/default/distsearch.conf other = (system|(apps/(?!pdfserver))|users(/_reserved)?//)/(bin|lookups)/...
/opt/splunk/etc/system/default/distsearch.conf refine.conf = (system|(apps/)|users(/_reserved)?//)/(default|local)/.conf
/opt/splunk/etc/system/default/distsearch.conf refine.metadata = (system|(apps/)|users(/_reserved)?//)/metadata/.meta
/opt/splunk/etc/system/default/distsearch.conf searchscripts = searchscripts/...
/opt/splunk/etc/system/default/distsearch.conf [tokenExchKeys]
/opt/splunk/etc/system/default/distsearch.conf certDir = $SPLUNK_HOME/etc/auth/distServerKeys
/opt/splunk/etc/system/default/distsearch.conf genKeyScript = $SPLUNK_HOME/bin/splunk, createssl, audit-keys
/opt/splunk/etc/system/default/distsearch.conf privateKey = private.pem
/opt/splunk/etc/system/default/distsearch.conf publicKey = trusted.pem
Hi so this couldn't be some automation like chef putting the file back for you?
hi @burwell
We don't have any automation or CM tools monitoring file systems that would restore the file.
And the file is created by user that runs splunk on the server. We tried to delete the file and restart splunk, but it gets restored again.
Try below approach.
$SPLUNK_HOME/bin/splunk resync shcluster-replicated-config
$SPLUNK_HOME/bin/splunk rolling-restart shcluster-members
Thanks @jawaharas
I don't see the file on on the captain now . Should I create a file with contents on captain and then run step 2 and 3 ?
Yep. Go ahead.
just tried the approach .
[distributedSearch]
servers =
Ran $SPLUNK_HOME/bin/splunk resync shcluster-replicated-config
Rolling restart of SH members
I checked couple of members where the restart was completed and found the distsearch.conf file got overridden again to old with contents.
[distributedSearch]
servers = https://10.xxx.36.000:8089,https://10.xxx.46.00:8089,https://10.xxx.46.00:8089,https://10.xxx.46.00:...
Update -
Found set of old search heads (including the captain) in the cluster got updated with the old distsearch.conf (overridden); we added 4 new search heads this week and they seem to be okay.
Did you run below command in search-head members (not in captain) and verify the config file content before restart?
$SPLUNK_HOME/bin/splunk resync shcluster-replicated-config
yes. Ran it across all SH members, except for captain , then verified the config file contents on all the members before restart; but still seeing the issue .
I hope you are using clustered indexers.
Can you check whether the shclustering stanza '$SPLUNK_HOME/etc/system/local/server.conf' file is consistent across all search-head members?
Also, can you share 'shclustering' stanza content from your search-head's 'server.conf' (after masking sensitive data)?
hi @jawaharas
Yes, we're using index clustering. I tried to delete the distsearch.conf again today and restarted splunk on the search heads and found it was re-created on all except one search head in the cluster.
[sslConfig]
sslKeysfilePassword = $1$EDkhKG6tJRyF
sslPassword = $1$EDkhKG6tJRyF
[lmpool:auto_generated_pool_download-trial]
description = auto_generated_pool_download-trial
quota = MAX
slaves = *
stack_id = download-trial
[lmpool:auto_generated_pool_forwarder]
description = auto_generated_pool_forwarder
quota = MAX
slaves = *
stack_id = forwarder
[lmpool:auto_generated_pool_free]
description = auto_generated_pool_free
quota = MAX
slaves = *
stack_id = free
[general]
pass4SymmKey = $1$EXktLS6/MxP38oI=
serverName = eo1vmsk099.lema
[license]
master_uri = https://eo1vmsk444.lema:8089
[replication_port://8090]
[raft_statemachine]
disabled = false
[shclustering]
conf_deploy_fetch_url = https://eo1vmsk555.lema:8089
disabled = 0
mgmt_uri = https://10.XXX.XX.XXX:8089
id = 013107EC-FC15-4338-A045-75942E648CB7
[clustering]
master_uri = clustermaster:eo1vmsk555.lema:8089
mode = searchhead
[clustermaster:eo1vmsk555.lema:8089]
master_uri = https://eo1vmsk555.lema:8089
multisite = 0
site = default
pass4SymmKey = $1$EXktLS6/MxP38oI=
The search head members fetches the configuration bundle from deployer (the host mentioned in 'conf_deploy_fetch_url' parameter).
Do you have connectivity between the search head (where you have issue) and the deployer (https://eo1vmsk555.lema:8089)?
The best practice is going to be editing this either from the GUI, or to create a app on the deployer and push this to the SHC. Editing config files does not trigger a replication task across the SHC, so when you edit this or delete off one host, the members are not aware of it and it can cause problems.
Hi. What version of Splunk is this happening on?
@burwell - it's 7.1.1