Solved: 6.5.2 unable to decommission an indexer peer node....

acamilo_2 · ‎01-11-2018

What I've tried:

On the indexer:
splunk offline --enforce-counts
On the master, observing splunk_monitoring_console/indexer_clustering_status
indexer goes to decommissioning but goes back to on after a few seconds.
On the indexer:
splunk offline
On the master, observe splunk_monitoring_console/indexer_clustering_status
indexer goes away but after a few seconds, it returns to on.
On the master:
splunk edit cluster-config -restart_timeout 1800
restart splunk
On the indexer:
splunk offline --enforce-counts
On the master
observe splunk_monitoring_console/indexer_clustering_status
indexer goes to decommissioning but goes back to on after a few seconds.

Thanks in advance.

nickhills · ‎01-12-2018

You might want to consider opening a ticket with support for this.

But..
Depending on your data volumes, If you have good replication and search factors, you could just fail the node. (ie pull the plug on it)
Your cluster will rebuild rep/search factor, but that's all offline --enforce-counts is doing (albeit more gracefully, and without warning you your cluster is inconsistent).

I had a peer which never finished decom after being left for 14 days. In the end we just turned it off, and apart from about 30 seconds of fixup on its internal logs, the cluster was totally happy and SF/RF was met.
Not saying this is the correct approach, but the reason you have a cluster is to tolerate failures like this.

If my comment helps, please give it a thumbs up!

View solution in original post

risfehani · ‎08-20-2019

Had this issue with a 7.04 indexer/peer; restarted Splunk (had to kill -9 the old restart processes as they were causing restart to hang too) .. once restarted re-ran 'splunk offline --enforce-counts' and it worked fine.

nickhills · ‎01-12-2018

You might want to consider opening a ticket with support for this.

But..
Depending on your data volumes, If you have good replication and search factors, you could just fail the node. (ie pull the plug on it)
Your cluster will rebuild rep/search factor, but that's all offline --enforce-counts is doing (albeit more gracefully, and without warning you your cluster is inconsistent).

I had a peer which never finished decom after being left for 14 days. In the end we just turned it off, and apart from about 30 seconds of fixup on its internal logs, the cluster was totally happy and SF/RF was met.
Not saying this is the correct approach, but the reason you have a cluster is to tolerate failures like this.

If my comment helps, please give it a thumbs up!

acamilo_2 · ‎03-19-2018

I inherited this deployment and eventually someone who worked on the original project told me to just shut it down and remove it from the master once it lost connectivity. Seriously.

pfender · ‎06-26-2018

When you just pull the plug, be aware that searches wont get the full set of data as long the bucket fixing goes on. That might be a consideration depending on the data you deal with. But I had the same experience with 6.5.5. All the indexers to be decom'ed ran out of disk space in their ../run/searchpeers folder and end up with a log message like:

ERROR SearchProcessRunner - launcher_thread=0 runSearch exception: PreforkedSearchProcessException: can't create preforked search process: Cannot send after transport endpoint shutdown

which at the end caused more trouble than just pulling the plug.

davidda · ‎06-25-2018

I just had the same issue, I found that you can click on the grey out bucket, you will access to the "bucket Status" menu.
Once you are on that menu you will see all the buckets that are waiting to be replicated, click on Action and choose the "Roll" option.
It will force the bucket to be replicated.

6.5.2 unable to decommission an indexer peer node. keeps coming back online after offline command is issued.

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!