Deployment Architecture

Why is our Search Head Cluster captain generating "ERROR KVStorageProvider - Could not update replica set configuration" every 5 seconds?

splunkIT
Splunk Employee
Splunk Employee

We have a 5 node Search Head Cluster.

The SHC captain generates the following message every 5 seconds:

03-01-2016 09:12:54.909 -0800 ERROR KVStorageProvider - Could not update replica set configuration, error domain 1, err code 12, Error message: Requested PRIMARY node is not available. 
03-01-2016 09:12:54.909 -0800 ERROR KVStoreConfigurationProvider - Failed to update replica set configuration

Polling the services/server/info REST end-point, the SHC captain returns:

<s:key name="kvStoreStatus">failed</s:key>

And SHC members return:

    <s:key name="kvStoreStatus">starting</s:key
    (c) mongod.log for 3 SHC members reports no local replica sets:
    2016-02-29T08:24:00.380Z I REPL [initandlisten] Did not find local replica set configuration document at startup; 
    NoMatchingDocument Did not find replica set configuration document in local.system.replset 
1 Solution

splunkIT
Splunk Employee
Splunk Employee

It is hard to tell exactly what happened with KVStore, possible issues are:
- possible ip address changes
- possible etc/instance.cfg file was removed and regenerated

If you don't currently use kvstore for your own collections, quickest solution is to cleanup the kvstore by invoking:
splunk clean kvstore --local

You have two options to perform the kvstore cleanup:

Option 1: Less steps, but will require shutting down all the SHC members prior to the cleanup:
1. Stop all SHC nodes.
2. Backup each $SPLUNK_DB/var/lib/kvstore folder on each SHC instance
3. Clean up KVStore with splunk clean kvstore --local on each node
4. Start all SHC nodes

Option 2: If cannot stop all members (requires a little bit longer set of steps):
1. Identify the captain node
2. One by one with each members (the one which is not captain) start doing:
a) Stop Splunk.
b) Backup $SPLUNK_DB/var/lib/kvstore
c) Clean up KVStore splunk clean kvstore --local
d) Disable KVStore in system.conf, see http://docs.splunk.com/Documentation/Splunk/6.3.3/Admin/Serverconf
e) Start this member
3. Do the same operations with the Captain
4. At this point customer should have working SHC with KVStore disabled on all members.
5. Verify that KVStore port is open and can be accesses from all other members of SHC.
6. Now we want to enable KVStore on each members one by one, again starting from non-captains start doing:
a) Stop Splunk
b) Enable kvstore in system.conf
c) Start Splunk
d) Verify that KVStore will have status ready on each member (for example {{ curl -s -k https://localhost:8089/services/server/info | grep kvStoreStatus}}. And there are no issues in splunkd.log of SHC Captain.
7. The last step is to do the same steps from 6 on captain.

At this point you should get fully working SHC

View solution in original post

splunkIT
Splunk Employee
Splunk Employee

It is hard to tell exactly what happened with KVStore, possible issues are:
- possible ip address changes
- possible etc/instance.cfg file was removed and regenerated

If you don't currently use kvstore for your own collections, quickest solution is to cleanup the kvstore by invoking:
splunk clean kvstore --local

You have two options to perform the kvstore cleanup:

Option 1: Less steps, but will require shutting down all the SHC members prior to the cleanup:
1. Stop all SHC nodes.
2. Backup each $SPLUNK_DB/var/lib/kvstore folder on each SHC instance
3. Clean up KVStore with splunk clean kvstore --local on each node
4. Start all SHC nodes

Option 2: If cannot stop all members (requires a little bit longer set of steps):
1. Identify the captain node
2. One by one with each members (the one which is not captain) start doing:
a) Stop Splunk.
b) Backup $SPLUNK_DB/var/lib/kvstore
c) Clean up KVStore splunk clean kvstore --local
d) Disable KVStore in system.conf, see http://docs.splunk.com/Documentation/Splunk/6.3.3/Admin/Serverconf
e) Start this member
3. Do the same operations with the Captain
4. At this point customer should have working SHC with KVStore disabled on all members.
5. Verify that KVStore port is open and can be accesses from all other members of SHC.
6. Now we want to enable KVStore on each members one by one, again starting from non-captains start doing:
a) Stop Splunk
b) Enable kvstore in system.conf
c) Start Splunk
d) Verify that KVStore will have status ready on each member (for example {{ curl -s -k https://localhost:8089/services/server/info | grep kvStoreStatus}}. And there are no issues in splunkd.log of SHC Captain.
7. The last step is to do the same steps from 6 on captain.

At this point you should get fully working SHC

johankellerman
New Member

Thanks!
Option 2 was the solution to my problem with a non working kvstore on the 3 member SHC. I tried several things , including clean kvstore --cluster, without any luck. Option 2 solved my problem.

0 Karma

tawollen
Path Finder

@splunkit, if you ~DO~ use KVStore?

0 Karma

splunkIT
Splunk Employee
Splunk Employee

@tawollen, you would have to use option 2 mention above. You can also restore KVStore if on step 6 for the first member after you will stop it -- you will copy over kvstore backup and execute "splunk clean kvstore --cluster" (without the quotes). For all other member you need to be sure that they will have a time to resynchronize all data from the first member (you can check introspection logs for that).

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...