We created a KV Store in a search head in clustered architecture, by adding the files collections.conf and transformations.conf.
--But we can't access the kvstore using inputlookup command and getting the below error.
"Error in 'inputlookup' command: External command based lookup 'kvstorecoll_lookup' is not available because KV Store initialization has not completed yet. Please try again later.
The search job has failed due to an error. You may be able view the job in the Job Inspector."
When checked for the status of the KV Store using the curl command, found it to be as "starting".
"curl -k -s https://localhost:8089/services/server/info | grep kvStore"
Also, Checked the mongod.log, deleted log and tried, but no success.
Checked for SSL certificate validity and found it to be "notAfter=Dec 9 19:01:45 2019 GMT"
Could not really trace out exact reason for the problem. Please suggest other options that we should try?
- If you still have primary or secondary and only some of them went out of sync, then use the method in the doc:
https://docs.splunk.com/Documentation/Splunk/latest/Admin/ResyncKVstore
- If you think the KVStore cluster is broken, such as no primary or secondary at all and need recovery then follow the below;
This could happen because you didn't have shcluster captain when the search was started. That's why the KVStore is in starting, not able to make it to "Ready" because SHC captain is the one should tell KVStore which members are available for ReplicaSet.
Follow the steps below to correct the situation:
1. Do backup $SPLUNK_HOME from all members!!!
2. Stop all SHC instances.
3. Run the command from all members
$ rm -rf $SPLUNK_HOME/var/run/splunk/_raft/*
$ splunk clean kvstore --cluster
> This is not deleting database but deleting cluster info.
4. Choose one member, where you think KVStore worked before. Edit as below,
in "$SPLUNK_HOME/etc/system/local/server.conf"
[shclustering]
replication_factor=1
5. Start this member
6. Bootstrap SHC with just this member
$ ./bin/splunk bootstrap shcluster-captain -servers_list "https://THIS_MEMBER_URL:8089" -auth admin:password
7. Verify SHC status
$ ./splunk show shcluster-status
8. Verify KVStore status, shoudl see 'ready' ,
$ curl -k -s https://localhost:8089/services/server/info | grep kvStore
OR $./splunk show kvstore-status
9. Stop this instance and change replication_factor to whatever it was before.
in "$SPLUNK_HOME/etc/system/local/server.conf"
[shclustering]
replication_factor=what it was before
10. Clean folder $SPLUNK_HOME/var/run/splunk/_raft/ on this instance.
11. Start ALL instances.
12. Bootstrap SHC with all members.
$ ./bin/splunk bootstrap shcluster-captain -servers_list "https://server1:8089,https://server2:8089,https://server3:8089,https://server4:8089" -auth admin:changeme
> replace the server1~4 with your member info.
13. Verify SHC status
$ ./splunk show shcluster-status
14. Verify KVStore status from all members
$ curl -k -s https://localhost:8089/services/server/info | grep kvStore
OR $./splunk show kvstore-status
15. If any data or collections are found to be missing then restore it from any latest good backup by using restore command, details found in the doc below;
https://docs.splunk.com/Documentation/Splunk/8.0.5/Admin/BackupKVstore
We use a private cloud and infact for the the KV Store status is showing as "starting" issue ingress to 8191 fixed the problem. There was no need for me to run the steps mentioned by sylim_splunk for my case.
If "$ curl -k -s https://localhost:8089/services/server/info | grep kvStore" doesn't work for you in sylim response, then you can use the following command to determine the status.
splunk show shcluster-status -auth username:password
I've just found this to be true in following this process to clear out the kvstore issue. Curl didn't work, but show shcluster-status worked fine.
- If you still have primary or secondary and only some of them went out of sync, then use the method in the doc:
https://docs.splunk.com/Documentation/Splunk/latest/Admin/ResyncKVstore
- If you think the KVStore cluster is broken, such as no primary or secondary at all and need recovery then follow the below;
This could happen because you didn't have shcluster captain when the search was started. That's why the KVStore is in starting, not able to make it to "Ready" because SHC captain is the one should tell KVStore which members are available for ReplicaSet.
Follow the steps below to correct the situation:
1. Do backup $SPLUNK_HOME from all members!!!
2. Stop all SHC instances.
3. Run the command from all members
$ rm -rf $SPLUNK_HOME/var/run/splunk/_raft/*
$ splunk clean kvstore --cluster
> This is not deleting database but deleting cluster info.
4. Choose one member, where you think KVStore worked before. Edit as below,
in "$SPLUNK_HOME/etc/system/local/server.conf"
[shclustering]
replication_factor=1
5. Start this member
6. Bootstrap SHC with just this member
$ ./bin/splunk bootstrap shcluster-captain -servers_list "https://THIS_MEMBER_URL:8089" -auth admin:password
7. Verify SHC status
$ ./splunk show shcluster-status
8. Verify KVStore status, shoudl see 'ready' ,
$ curl -k -s https://localhost:8089/services/server/info | grep kvStore
OR $./splunk show kvstore-status
9. Stop this instance and change replication_factor to whatever it was before.
in "$SPLUNK_HOME/etc/system/local/server.conf"
[shclustering]
replication_factor=what it was before
10. Clean folder $SPLUNK_HOME/var/run/splunk/_raft/ on this instance.
11. Start ALL instances.
12. Bootstrap SHC with all members.
$ ./bin/splunk bootstrap shcluster-captain -servers_list "https://server1:8089,https://server2:8089,https://server3:8089,https://server4:8089" -auth admin:changeme
> replace the server1~4 with your member info.
13. Verify SHC status
$ ./splunk show shcluster-status
14. Verify KVStore status from all members
$ curl -k -s https://localhost:8089/services/server/info | grep kvStore
OR $./splunk show kvstore-status
15. If any data or collections are found to be missing then restore it from any latest good backup by using restore command, details found in the doc below;
https://docs.splunk.com/Documentation/Splunk/8.0.5/Admin/BackupKVstore
stale product, hasn't improved much in state management after 10 years.
After applying the changes from Step 9-12. stopping splunk service,changing the replication factor back to what it was and starting all SH members. The KVstore status goes back to starting. The KVstore status was initially in Ready mode after applying step 1-8. Is there any reason why this is happening.
Thanks for writing this up. Not only did it fix my problem, but I also learnt nice tricks and workflow.
I followed the steps, but my status is still "starting" on the captain, replicationStatus is "Startup" on one member and "Down" on the other, even though it was "ready" when I brought up the captain solo. How long should it take to replicate?
An issue that I've found is that port 8191 for mongo has to be opened so the SH in the cluster can replicate the kvstore data. I ran through the helpful procedure above by sylim then when everything came up, I got a notice that the non-captain cluster members were getting "failed with No route to host".
sudo /usr/bin/firewall-cmd --add-port=8191/tcp
sudo /usr/bin/firewall-cmd --runtime-to-permanent
sudo /usr/bin/firewall-cmd --list-ports --permanent
fixed the port issue and all is well.
Opened 8191 to resolve as well.
Happened on initial build.
I had the same problem and found that kvstore was waiting for a response from peers in port 8191, which was not allowed on the SG. Did not need to follow the process to reassign captain. Thanks sherm77