Deployment Architecture

Why is our Search Head Cluster captain generating "ERROR KVStorageProvider - Could not update replica set configuration" every 5 seconds?

splunkIT
Splunk Employee
Splunk Employee

We have a 5 node Search Head Cluster.

The SHC captain generates the following message every 5 seconds:

03-01-2016 09:12:54.909 -0800 ERROR KVStorageProvider - Could not update replica set configuration, error domain 1, err code 12, Error message: Requested PRIMARY node is not available. 
03-01-2016 09:12:54.909 -0800 ERROR KVStoreConfigurationProvider - Failed to update replica set configuration

Polling the services/server/info REST end-point, the SHC captain returns:

<s:key name="kvStoreStatus">failed</s:key>

And SHC members return:

    <s:key name="kvStoreStatus">starting</s:key
    (c) mongod.log for 3 SHC members reports no local replica sets:
    2016-02-29T08:24:00.380Z I REPL [initandlisten] Did not find local replica set configuration document at startup; 
    NoMatchingDocument Did not find replica set configuration document in local.system.replset 
1 Solution

splunkIT
Splunk Employee
Splunk Employee

It is hard to tell exactly what happened with KVStore, possible issues are:
- possible ip address changes
- possible etc/instance.cfg file was removed and regenerated

If you don't currently use kvstore for your own collections, quickest solution is to cleanup the kvstore by invoking:
splunk clean kvstore --local

You have two options to perform the kvstore cleanup:

Option 1: Less steps, but will require shutting down all the SHC members prior to the cleanup:
1. Stop all SHC nodes.
2. Backup each $SPLUNK_DB/var/lib/kvstore folder on each SHC instance
3. Clean up KVStore with splunk clean kvstore --local on each node
4. Start all SHC nodes

Option 2: If cannot stop all members (requires a little bit longer set of steps):
1. Identify the captain node
2. One by one with each members (the one which is not captain) start doing:
a) Stop Splunk.
b) Backup $SPLUNK_DB/var/lib/kvstore
c) Clean up KVStore splunk clean kvstore --local
d) Disable KVStore in system.conf, see http://docs.splunk.com/Documentation/Splunk/6.3.3/Admin/Serverconf
e) Start this member
3. Do the same operations with the Captain
4. At this point customer should have working SHC with KVStore disabled on all members.
5. Verify that KVStore port is open and can be accesses from all other members of SHC.
6. Now we want to enable KVStore on each members one by one, again starting from non-captains start doing:
a) Stop Splunk
b) Enable kvstore in system.conf
c) Start Splunk
d) Verify that KVStore will have status ready on each member (for example {{ curl -s -k https://localhost:8089/services/server/info | grep kvStoreStatus}}. And there are no issues in splunkd.log of SHC Captain.
7. The last step is to do the same steps from 6 on captain.

At this point you should get fully working SHC

View solution in original post

splunkIT
Splunk Employee
Splunk Employee

It is hard to tell exactly what happened with KVStore, possible issues are:
- possible ip address changes
- possible etc/instance.cfg file was removed and regenerated

If you don't currently use kvstore for your own collections, quickest solution is to cleanup the kvstore by invoking:
splunk clean kvstore --local

You have two options to perform the kvstore cleanup:

Option 1: Less steps, but will require shutting down all the SHC members prior to the cleanup:
1. Stop all SHC nodes.
2. Backup each $SPLUNK_DB/var/lib/kvstore folder on each SHC instance
3. Clean up KVStore with splunk clean kvstore --local on each node
4. Start all SHC nodes

Option 2: If cannot stop all members (requires a little bit longer set of steps):
1. Identify the captain node
2. One by one with each members (the one which is not captain) start doing:
a) Stop Splunk.
b) Backup $SPLUNK_DB/var/lib/kvstore
c) Clean up KVStore splunk clean kvstore --local
d) Disable KVStore in system.conf, see http://docs.splunk.com/Documentation/Splunk/6.3.3/Admin/Serverconf
e) Start this member
3. Do the same operations with the Captain
4. At this point customer should have working SHC with KVStore disabled on all members.
5. Verify that KVStore port is open and can be accesses from all other members of SHC.
6. Now we want to enable KVStore on each members one by one, again starting from non-captains start doing:
a) Stop Splunk
b) Enable kvstore in system.conf
c) Start Splunk
d) Verify that KVStore will have status ready on each member (for example {{ curl -s -k https://localhost:8089/services/server/info | grep kvStoreStatus}}. And there are no issues in splunkd.log of SHC Captain.
7. The last step is to do the same steps from 6 on captain.

At this point you should get fully working SHC

View solution in original post

johankellerman
New Member

Thanks!
Option 2 was the solution to my problem with a non working kvstore on the 3 member SHC. I tried several things , including clean kvstore --cluster, without any luck. Option 2 solved my problem.

0 Karma

tawollen
Path Finder

@splunkit, if you ~DO~ use KVStore?

0 Karma

splunkIT
Splunk Employee
Splunk Employee

@tawollen, you would have to use option 2 mention above. You can also restore KVStore if on step 6 for the first member after you will stop it -- you will copy over kvstore backup and execute "splunk clean kvstore --cluster" (without the quotes). For all other member you need to be sure that they will have a time to resynchronize all data from the first member (you can check introspection logs for that).

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!