What I've tried:
On the indexer:
splunk offline --enforce-counts
On the master, observing splunk_monitoring_console/indexer_clustering_status
indexer goes to decommissioning but goes back to on after a few seconds.
On the indexer:
splunk offline
On the master, observe splunk_monitoring_console/indexer_clustering_status
indexer goes away but after a few seconds, it returns to on.
On the master:
splunk edit cluster-config -restart_timeout 1800
restart splunk
On the indexer:
splunk offline --enforce-counts
On the master
observe splunk_monitoring_console/indexer_clustering_status
indexer goes to decommissioning but goes back to on after a few seconds.
Thanks in advance.
You might want to consider opening a ticket with support for this.
But..
Depending on your data volumes, If you have good replication and search factors, you could just fail the node. (ie pull the plug on it)
Your cluster will rebuild rep/search factor, but that's all offline --enforce-counts
is doing (albeit more gracefully, and without warning you your cluster is inconsistent).
I had a peer which never finished decom after being left for 14 days. In the end we just turned it off, and apart from about 30 seconds of fixup on its internal logs, the cluster was totally happy and SF/RF was met.
Not saying this is the correct approach, but the reason you have a cluster is to tolerate failures like this.
Had this issue with a 7.04 indexer/peer; restarted Splunk (had to kill -9 the old restart processes as they were causing restart to hang too) .. once restarted re-ran 'splunk offline --enforce-counts' and it worked fine.
You might want to consider opening a ticket with support for this.
But..
Depending on your data volumes, If you have good replication and search factors, you could just fail the node. (ie pull the plug on it)
Your cluster will rebuild rep/search factor, but that's all offline --enforce-counts
is doing (albeit more gracefully, and without warning you your cluster is inconsistent).
I had a peer which never finished decom after being left for 14 days. In the end we just turned it off, and apart from about 30 seconds of fixup on its internal logs, the cluster was totally happy and SF/RF was met.
Not saying this is the correct approach, but the reason you have a cluster is to tolerate failures like this.
I inherited this deployment and eventually someone who worked on the original project told me to just shut it down and remove it from the master once it lost connectivity. Seriously.
When you just pull the plug, be aware that searches wont get the full set of data as long the bucket fixing goes on. That might be a consideration depending on the data you deal with. But I had the same experience with 6.5.5. All the indexers to be decom'ed ran out of disk space in their ../run/searchpeers folder and end up with a log message like:
ERROR SearchProcessRunner - launcher_thread=0 runSearch exception: PreforkedSearchProcessException: can't create preforked search process: Cannot send after transport endpoint shutdown
which at the end caused more trouble than just pulling the plug.
I just had the same issue, I found that you can click on the grey out bucket, you will access to the "bucket Status" menu.
Once you are on that menu you will see all the buckets that are waiting to be replicated, click on Action and choose the "Roll" option.
It will force the bucket to be replicated.