Deployment Architecture

Why is the KV Store status is showing as "starting" in Search Head Cluster?

TGanga
Explorer

We created a KV Store in a search head in clustered architecture, by adding the files collections.conf and transformations.conf.
--But we can't access the kvstore using inputlookup command and getting the below error.

"Error in 'inputlookup' command: External command based lookup 'kvstorecoll_lookup' is not available because KV Store initialization has not completed yet. Please try again later.
The search job has failed due to an error. You may be able view the job in the Job Inspector."

When checked for the status of the KV Store using the curl command, found it to be as "starting".

"curl -k -s https://localhost:8089/services/server/info | grep kvStore"

Also, Checked the mongod.log, deleted log and tried, but no success.
Checked for SSL certificate validity and found it to be "notAfter=Dec 9 19:01:45 2019 GMT"

Could not really trace out exact reason for the problem. Please suggest other options that we should try?

1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

- If you still have primary or secondary and only some of them went out of sync, then use the method in the doc:

https://docs.splunk.com/Documentation/Splunk/latest/Admin/ResyncKVstore

- If you think the KVStore cluster is broken, such as no primary or secondary at all and need recovery then follow the below;

This could happen because you didn't have shcluster captain when the search was started. That's why the KVStore is in starting, not able to make it to "Ready" because SHC captain is the one should tell KVStore which members are available for ReplicaSet.

Follow the steps below to correct the situation:
1. Do backup $SPLUNK_HOME from all members!!!
2. Stop all SHC instances.
3. Run the command from all members
     $ rm -rf $SPLUNK_HOME/var/run/splunk/_raft/*

     $ splunk clean kvstore --cluster
     > This is not deleting database but deleting cluster info.
4. Choose one member, where you think KVStore worked before. Edit as below,
    in "$SPLUNK_HOME/etc/system/local/server.conf"
    [shclustering]
    replication_factor=1
5. Start this member
6. Bootstrap SHC with just this member
     $ ./bin/splunk bootstrap shcluster-captain -servers_list "https://THIS_MEMBER_URL:8089" -auth admin:password
7. Verify SHC status
     $ ./splunk show shcluster-status
8. Verify KVStore status, shoudl see 'ready' ,
     $  curl -k -s https://localhost:8089/services/server/info | grep kvStore

    OR $./splunk show kvstore-status
9. Stop this instance and change replication_factor to whatever it was before.
      in "$SPLUNK_HOME/etc/system/local/server.conf"
      [shclustering]
      replication_factor=what it was before
10. Clean folder $SPLUNK_HOME/var/run/splunk/_raft/ on this instance.
11. Start ALL instances.
12. Bootstrap SHC with all members.

      $ ./bin/splunk bootstrap shcluster-captain -servers_list "https://server1:8089,https://server2:8089,https://server3:8089,https://server4:8089" -auth admin:changeme
       > replace the server1~4 with your member info.
13. Verify SHC status
       $ ./splunk show shcluster-status
14. Verify KVStore status from all members
       $ curl -k -s https://localhost:8089/services/server/info | grep kvStore
       OR $./splunk show kvstore-status

 

15. If any data or collections are found to be missing then restore it from any latest good backup by using restore command, details found in the doc below;

https://docs.splunk.com/Documentation/Splunk/8.0.5/Admin/BackupKVstore  

View solution in original post

shankern
Explorer

We use a private cloud and infact for the the KV Store status is showing as "starting" issue ingress to 8191 fixed the problem.  There was no need for me to run the steps mentioned by sylim_splunk  for my case.

0 Karma

trydman
Engager

If "$ curl -k -s https://localhost:8089/services/server/info | grep kvStore" doesn't work for you in sylim response, then you can use the following command to determine the status.

splunk show shcluster-status -auth username:password

sherm77
Path Finder

I've just found this to be true in following this process to clear out the kvstore issue. Curl didn't work, but show shcluster-status worked fine.

0 Karma

sylim_splunk
Splunk Employee
Splunk Employee

- If you still have primary or secondary and only some of them went out of sync, then use the method in the doc:

https://docs.splunk.com/Documentation/Splunk/latest/Admin/ResyncKVstore

- If you think the KVStore cluster is broken, such as no primary or secondary at all and need recovery then follow the below;

This could happen because you didn't have shcluster captain when the search was started. That's why the KVStore is in starting, not able to make it to "Ready" because SHC captain is the one should tell KVStore which members are available for ReplicaSet.

Follow the steps below to correct the situation:
1. Do backup $SPLUNK_HOME from all members!!!
2. Stop all SHC instances.
3. Run the command from all members
     $ rm -rf $SPLUNK_HOME/var/run/splunk/_raft/*

     $ splunk clean kvstore --cluster
     > This is not deleting database but deleting cluster info.
4. Choose one member, where you think KVStore worked before. Edit as below,
    in "$SPLUNK_HOME/etc/system/local/server.conf"
    [shclustering]
    replication_factor=1
5. Start this member
6. Bootstrap SHC with just this member
     $ ./bin/splunk bootstrap shcluster-captain -servers_list "https://THIS_MEMBER_URL:8089" -auth admin:password
7. Verify SHC status
     $ ./splunk show shcluster-status
8. Verify KVStore status, shoudl see 'ready' ,
     $  curl -k -s https://localhost:8089/services/server/info | grep kvStore

    OR $./splunk show kvstore-status
9. Stop this instance and change replication_factor to whatever it was before.
      in "$SPLUNK_HOME/etc/system/local/server.conf"
      [shclustering]
      replication_factor=what it was before
10. Clean folder $SPLUNK_HOME/var/run/splunk/_raft/ on this instance.
11. Start ALL instances.
12. Bootstrap SHC with all members.

      $ ./bin/splunk bootstrap shcluster-captain -servers_list "https://server1:8089,https://server2:8089,https://server3:8089,https://server4:8089" -auth admin:changeme
       > replace the server1~4 with your member info.
13. Verify SHC status
       $ ./splunk show shcluster-status
14. Verify KVStore status from all members
       $ curl -k -s https://localhost:8089/services/server/info | grep kvStore
       OR $./splunk show kvstore-status

 

15. If any data or collections are found to be missing then restore it from any latest good backup by using restore command, details found in the doc below;

https://docs.splunk.com/Documentation/Splunk/8.0.5/Admin/BackupKVstore  

kundeng
Path Finder

stale product, hasn't improved much in state management after 10 years. 

 

0 Karma

yomesky2000
Observer

After applying the changes from Step 9-12. stopping splunk service,changing the replication factor back to what it was and starting all SH members. The KVstore status goes back to starting.  The KVstore status was initially in Ready mode after applying step 1-8. Is there any reason why this is happening.

0 Karma

mcederhage_splu
Splunk Employee
Splunk Employee

Thanks for writing this up. Not only did it fix my problem, but I also learnt nice tricks and workflow.

0 Karma

esalesapns2
Communicator

I followed the steps, but my status is still "starting" on the captain, replicationStatus is "Startup" on one member and "Down" on the other, even though it was "ready" when I brought up the captain solo. How long should it take to replicate?

0 Karma

sherm77
Path Finder

An issue that I've found is that port 8191 for mongo has to be opened so the SH in the cluster can replicate the kvstore data. I ran through the helpful procedure above by sylim then when everything came up, I got a notice that the non-captain cluster members were getting "failed with No route to host".

sudo /usr/bin/firewall-cmd --add-port=8191/tcp
sudo /usr/bin/firewall-cmd --runtime-to-permanent
sudo /usr/bin/firewall-cmd --list-ports --permanent

fixed the port issue and all is well.

jkat54
SplunkTrust
SplunkTrust

Opened 8191 to resolve as well.

Happened on initial build.

0 Karma

liliana_colman
Engager

I had the same problem and found that kvstore was waiting for a response from peers in port 8191, which was not allowed on the SG. Did not need to follow the process to reassign captain. Thanks sherm77

Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...