Deployment Architecture

Why is the KV Store status is showing as "starting" in Search Head Cluster?

Explorer

We created a KV Store in a search head in clustered architecture, by adding the files collections.conf and transformations.conf.
--But we can't access the kvstore using inputlookup command and getting the below error.

"Error in 'inputlookup' command: External command based lookup 'kvstorecoll_lookup' is not available because KV Store initialization has not completed yet. Please try again later.
The search job has failed due to an error. You may be able view the job in the Job Inspector."

When checked for the status of the KV Store using the curl command, found it to be as "starting".

"curl -k -s https://localhost:8089/services/server/info | grep kvStore"

Also, Checked the mongod.log, deleted log and tried, but no success.
Checked for SSL certificate validity and found it to be "notAfter=Dec 9 19:01:45 2019 GMT"

Could not really trace out exact reason for the problem. Please suggest other options that we should try?

1 Solution

Splunk Employee
Splunk Employee

This could happen because you didn't have shcluster captain when the search was started. That's why the KVStore is in starting, not able to make it to "Ready" because SHC captain is the one should tell KVStore which members are available for ReplicaSet.

Follow the steps below to correct the situation:
1. Do backup $SPLUNK_HOME from all members!!!
2. Stop all SHC instances.
3. Run the command from all members
$ rm -rf $SPLUNK_HOME/var/run/splunk/_raft/*

$ splunk clean kvstore --cluster
> This is not deleting database but deleting cluster info.
4. Choose one member, where you think KVStore worked before. Edit as below,
in "$SPLUNK_HOME/etc/system/local/server.conf"
[shclustering]
replication_factor=1
5. Start this member
6. Bootstrap SHC with just this member
$ ./bin/splunk bootstrap shcluster-captain -servers_list "https://THIS_MEMBER_URL:8089" -auth admin:changeme
7. Verify SHC status
$ ./splunk show shcluster-status
8. Verify KVStore status, shoudl see 'ready' ,
$ curl -k -s https://localhost:8089/services/server/info | grep kvStore
9. Stop this instance and change replication_factor to whatever it was before.
in "$SPLUNK_HOME/etc/system/local/server.conf"
[shclustering]
replication_factor=what it was before
10. Clean folder $SPLUNK_HOME/var/run/splunk/_raft/ on this instance.
11. Start ALL instances.
12. Bootstrap SHC with all members.

$ ./bin/splunk bootstrap shcluster-captain -servers_list "https://server1:8089,https://server2:8089,https://server3:8089,https://server4:8089" -auth admin:changeme
> replace the server1~4 with your member info.
13. Verify SHC status
$ ./splunk show shcluster-status
14. Verify KVStore status from all members
$ curl -k -s https://localhost:8089/services/server/info | grep kvStore
OR ./splunk show kvstore-status

View solution in original post

Engager

If "$ curl -k -s https://localhost:8089/services/server/info | grep kvStore" doesn't work for you in sylim response, then you can use the following command to determine the status.

splunk show shcluster-status -auth username:password

Path Finder

I've just found this to be true in following this process to clear out the kvstore issue. Curl didn't work, but show shcluster-status worked fine.

0 Karma

Splunk Employee
Splunk Employee

This could happen because you didn't have shcluster captain when the search was started. That's why the KVStore is in starting, not able to make it to "Ready" because SHC captain is the one should tell KVStore which members are available for ReplicaSet.

Follow the steps below to correct the situation:
1. Do backup $SPLUNK_HOME from all members!!!
2. Stop all SHC instances.
3. Run the command from all members
$ rm -rf $SPLUNK_HOME/var/run/splunk/_raft/*

$ splunk clean kvstore --cluster
> This is not deleting database but deleting cluster info.
4. Choose one member, where you think KVStore worked before. Edit as below,
in "$SPLUNK_HOME/etc/system/local/server.conf"
[shclustering]
replication_factor=1
5. Start this member
6. Bootstrap SHC with just this member
$ ./bin/splunk bootstrap shcluster-captain -servers_list "https://THIS_MEMBER_URL:8089" -auth admin:changeme
7. Verify SHC status
$ ./splunk show shcluster-status
8. Verify KVStore status, shoudl see 'ready' ,
$ curl -k -s https://localhost:8089/services/server/info | grep kvStore
9. Stop this instance and change replication_factor to whatever it was before.
in "$SPLUNK_HOME/etc/system/local/server.conf"
[shclustering]
replication_factor=what it was before
10. Clean folder $SPLUNK_HOME/var/run/splunk/_raft/ on this instance.
11. Start ALL instances.
12. Bootstrap SHC with all members.

$ ./bin/splunk bootstrap shcluster-captain -servers_list "https://server1:8089,https://server2:8089,https://server3:8089,https://server4:8089" -auth admin:changeme
> replace the server1~4 with your member info.
13. Verify SHC status
$ ./splunk show shcluster-status
14. Verify KVStore status from all members
$ curl -k -s https://localhost:8089/services/server/info | grep kvStore
OR ./splunk show kvstore-status

View solution in original post

Splunk Employee
Splunk Employee

Thanks for writing this up. Not only did it fix my problem, but I also learnt nice tricks and workflow.

0 Karma

Path Finder

I followed the steps, but my status is still "starting" on the captain, replicationStatus is "Startup" on one member and "Down" on the other, even though it was "ready" when I brought up the captain solo. How long should it take to replicate?

0 Karma

Path Finder

An issue that I've found is that port 8191 for mongo has to be opened so the SH in the cluster can replicate the kvstore data. I ran through the helpful procedure above by sylim then when everything came up, I got a notice that the non-captain cluster members were getting "failed with No route to host".

sudo /usr/bin/firewall-cmd --add-port=8191/tcp
sudo /usr/bin/firewall-cmd --runtime-to-permanent
sudo /usr/bin/firewall-cmd --list-ports --permanent

fixed the port issue and all is well.

I had the same problem and found that kvstore was waiting for a response from peers in port 8191, which was not allowed on the SG. Did not need to follow the process to reassign captain. Thanks sherm77

0 Karma