Solved: Why is our Search Head Cluster captain generating ...

splunkIT · ‎03-04-2016

We have a 5 node Search Head Cluster.

The SHC captain generates the following message every 5 seconds:

03-01-2016 09:12:54.909 -0800 ERROR KVStorageProvider - Could not update replica set configuration, error domain 1, err code 12, Error message: Requested PRIMARY node is not available. 
03-01-2016 09:12:54.909 -0800 ERROR KVStoreConfigurationProvider - Failed to update replica set configuration

Polling the services/server/info REST end-point, the SHC captain returns:

<s:key name="kvStoreStatus">failed</s:key>

And SHC members return:

    <s:key name="kvStoreStatus">starting</s:key
    (c) mongod.log for 3 SHC members reports no local replica sets:
    2016-02-29T08:24:00.380Z I REPL [initandlisten] Did not find local replica set configuration document at startup; 
    NoMatchingDocument Did not find replica set configuration document in local.system.replset

splunkIT · ‎03-04-2016

It is hard to tell exactly what happened with KVStore, possible issues are:
- possible ip address changes
- possible etc/instance.cfg file was removed and regenerated

If you don't currently use kvstore for your own collections, quickest solution is to cleanup the kvstore by invoking:
splunk clean kvstore --local

You have two options to perform the kvstore cleanup:

Option 1: Less steps, but will require shutting down all the SHC members prior to the cleanup:
1. Stop all SHC nodes.
2. Backup each $SPLUNK_DB/var/lib/kvstore folder on each SHC instance
3. Clean up KVStore with splunk clean kvstore --local on each node
4. Start all SHC nodes

Option 2: If cannot stop all members (requires a little bit longer set of steps):
1. Identify the captain node
2. One by one with each members (the one which is not captain) start doing:
a) Stop Splunk.
b) Backup $SPLUNK_DB/var/lib/kvstore
c) Clean up KVStore splunk clean kvstore --local
d) Disable KVStore in system.conf, see http://docs.splunk.com/Documentation/Splunk/6.3.3/Admin/Serverconf
e) Start this member
3. Do the same operations with the Captain
4. At this point customer should have working SHC with KVStore disabled on all members.
5. Verify that KVStore port is open and can be accesses from all other members of SHC.
6. Now we want to enable KVStore on each members one by one, again starting from non-captains start doing:
a) Stop Splunk
b) Enable kvstore in system.conf
c) Start Splunk
d) Verify that KVStore will have status ready on each member (for example {{ curl -s -k https://localhost:8089/services/server/info | grep kvStoreStatus}}. And there are no issues in splunkd.log of SHC Captain.
7. The last step is to do the same steps from 6 on captain.

At this point you should get fully working SHC

View solution in original post

splunkIT · ‎03-04-2016

It is hard to tell exactly what happened with KVStore, possible issues are:
- possible ip address changes
- possible etc/instance.cfg file was removed and regenerated

If you don't currently use kvstore for your own collections, quickest solution is to cleanup the kvstore by invoking:
splunk clean kvstore --local

You have two options to perform the kvstore cleanup:

Option 1: Less steps, but will require shutting down all the SHC members prior to the cleanup:
1. Stop all SHC nodes.
2. Backup each $SPLUNK_DB/var/lib/kvstore folder on each SHC instance
3. Clean up KVStore with splunk clean kvstore --local on each node
4. Start all SHC nodes

Option 2: If cannot stop all members (requires a little bit longer set of steps):
1. Identify the captain node
2. One by one with each members (the one which is not captain) start doing:
a) Stop Splunk.
b) Backup $SPLUNK_DB/var/lib/kvstore
c) Clean up KVStore splunk clean kvstore --local
d) Disable KVStore in system.conf, see http://docs.splunk.com/Documentation/Splunk/6.3.3/Admin/Serverconf
e) Start this member
3. Do the same operations with the Captain
4. At this point customer should have working SHC with KVStore disabled on all members.
5. Verify that KVStore port is open and can be accesses from all other members of SHC.
6. Now we want to enable KVStore on each members one by one, again starting from non-captains start doing:
a) Stop Splunk
b) Enable kvstore in system.conf
c) Start Splunk
d) Verify that KVStore will have status ready on each member (for example {{ curl -s -k https://localhost:8089/services/server/info | grep kvStoreStatus}}. And there are no issues in splunkd.log of SHC Captain.
7. The last step is to do the same steps from 6 on captain.

At this point you should get fully working SHC

johankellerman · ‎12-29-2018

Thanks!
Option 2 was the solution to my problem with a non working kvstore on the 3 member SHC. I tried several things , including clean kvstore --cluster, without any luck. Option 2 solved my problem.

tawollen · ‎10-17-2016

@splunkit, if you ~DO~ use KVStore?

splunkIT · ‎10-25-2016

@tawollen, you would have to use option 2 mention above. You can also restore KVStore if on step 6 for the first member after you will stop it -- you will copy over kvstore backup and execute "splunk clean kvstore --cluster" (without the quotes). For all other member you need to be sure that they will have a time to resynchronize all data from the first member (you can check introspection logs for that).

Why is our Search Head Cluster captain generating "ERROR KVStorageProvider - Could not update replica set configuration" every 5 seconds?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Accelerating Observability as Code with the Splunk AI Assistant

Join the Conversation

Why is our Search Head Cluster captain generating "ERROR KVStorageProvider - Could not update replica set configuration" every 5 seconds?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Accelerating Observability as Code with the Splunk AI Assistant