Solved: Fixing Splunk SH cluster where one member left (Di...

ramesh_babu71 · ‎01-15-2018

Hi,
We had four members in SH cluster (all in VM) and the setup was working properly until yesterday. Today one of the VM showed an error that it cannot power-on one of the SH cluster member as the disk has been corrupted beyond repairs.

As of now we can work with 3 SH cluster members as it can elect captain and support our requirements well enough. However, while pushing updates via deployer we are getting error that it cannot reach this (down) SH and fails deploying Apps

./splunk apply shcluster-bundle --answer-yes -target https://splunkSH:8089 -auth username:Password

Error while deploying apps to first member: Error while fetching apps baseline on target=https://192.x.x.x:8089: Network-layer error: No route to host

Please let me know how can I remove the entry for this obsolete Splunk SH member from cluster list.

mayurr98 · ‎01-15-2018

Hey follow the steps to remove a member from the cluster
Go to /opt/splunk/bin
1. Remove the member.

To run the splunk remove command from another member, use this version:

./splunk remove shcluster-member -mgmt_uri <URI>:<management_port>

Note the following:

mgmt_uri is the management URI of the member being removed from the cluster.

By removing the instance from the search head cluster, you automatically remove it from the KV store. To confirm that this instance has been removed from the KV store, run splunk show kvstore-status on any remaining cluster member. The instance should not appear in the set of results. If it does appear, there might be problems with the health of your search head cluster.

Let me know if this helps !

View solution in original post

mayurr98 · ‎01-15-2018

Hey follow the steps to remove a member from the cluster
Go to /opt/splunk/bin
1. Remove the member.

To run the splunk remove command from another member, use this version:

./splunk remove shcluster-member -mgmt_uri <URI>:<management_port>

Note the following:

mgmt_uri is the management URI of the member being removed from the cluster.

By removing the instance from the search head cluster, you automatically remove it from the KV store. To confirm that this instance has been removed from the KV store, run splunk show kvstore-status on any remaining cluster member. The instance should not appear in the set of results. If it does appear, there might be problems with the health of your search head cluster.

Let me know if this helps !

ramesh_babu71 · ‎01-15-2018

@ mayurr98
I read this document but it specifically asks to keep (splunk service in) the instance we are removing running.

Remove the member
Caution: Do not stop the member before removing it from the cluster.

However, In my case that can't be done as the server is already down and has been deleted (from VM console).

Is it possible to run this command on another cluster member or captain even if the target server is down.

mayurr98 · ‎01-15-2018

hey ramesh try this out

The solution was to run the "splunk resync kvstore" command, as linked to from the following thread:
thereafter run ./splunk remove shcluster-member -mgmt_uri <URI>:<management_port> on another sh member and put the ip of the one you want to remove

As long as current SHC are stable, , in your situation, potentially you can re-build SHC by following the doc below;
http://docs.splunk.com/Documentation/Splunk/6.5.2/DistSearch/Handleraftissues#Fix_the_entire_cluster

If only KVstore is the one complaining and SHC itelf is not looking for the removed SH node anymore, "kvstore resync" will remove the node from the list. Please follow the doc below;
http://docs.splunk.com/Documentation/Splunk/6.5.2/Admin/ResyncKVstore

ramesh_babu71 · ‎01-15-2018

Thanks @harsmarvania57 & @mayurr98
The below step worked even with the target Splunk instance being down.

./splunk remove shcluster-member -mgmt_uri <URI>:<management_port>

Now it is not showing the message of Splunk instance being down in console or in Splunk bundle update. I didn't have to perform the resync kvstore. If i do it later on (to fix issues related this) I will update it here.

I believe Splunk should update the documentation saying this works even if we need to remove entry of a SHC member which leaves abruptly . 🙂

mayurr98 · ‎01-15-2018

Hey thanks a lot! I am glad that my solution helped you !!

horsefez · ‎01-15-2018

Hi,

I'm not even sure how your setup with 4 SH's even worked as I believe you need a uneven (3, 5, 7...) number of SH's at all times.

ramesh_babu71 · ‎01-15-2018

@ harsmarvania57
I read that document but it says to keep (splunk service in) the instance we are removing running.

Remove the member
Caution: Do not stop the member before removing it from the cluster.

However, In my case that can't be done as the server is already down and has been deleted (from VM console).

harsmarvania57 · ‎01-15-2018

If it is a test environment then I'll try to run that command, I know that they mentioned that splunk should run on member which you are trying to remove.

ramesh_babu71 · ‎01-15-2018

Hmm...It was working fine. We have this environment for doing testing of Splunk App. It worked fine till now. We had 3 CentOS and 1 Windows for this setup. We even upgraded from 6.6 to 7.0 and was still working fine till the disk crash for windows server.

Still others instances in the cluster are working fine other than the issue that it constantly shows the message of its missing windows server amigo 😞

harsmarvania57 · ‎01-15-2018

Hi @ramesh_babu71,

Please follow this document https://docs.splunk.com/Documentation/Splunk/7.0.1/DistSearch/Removeaclustermember to remove member from SH Cluster and then try to deploy apps from Deployer.

Fixing Splunk SH cluster where one member left (Disk Failure)

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!