Deployment Architecture

6.5.2 unable to decommission an indexer peer node. keeps coming back online after offline command is issued.

New Member

What I've tried:

  1. On the indexer:
    splunk offline --enforce-counts
    On the master, observing splunk_monitoring_console/indexer_clustering_status
    indexer goes to decommissioning but goes back to on after a few seconds.

  2. On the indexer:
    splunk offline
    On the master, observe splunk_monitoring_console/indexer_clustering_status
    indexer goes away but after a few seconds, it returns to on.

  3. On the master:
    splunk edit cluster-config -restart_timeout 1800
    restart splunk
    On the indexer:
    splunk offline --enforce-counts
    On the master
    observe splunk_monitoring_console/indexer_clustering_status
    indexer goes to decommissioning but goes back to on after a few seconds.

Thanks in advance.

0 Karma
1 Solution

Ultra Champion

You might want to consider opening a ticket with support for this.

But..
Depending on your data volumes, If you have good replication and search factors, you could just fail the node. (ie pull the plug on it)
Your cluster will rebuild rep/search factor, but that's all offline --enforce-counts is doing (albeit more gracefully, and without warning you your cluster is inconsistent).

I had a peer which never finished decom after being left for 14 days. In the end we just turned it off, and apart from about 30 seconds of fixup on its internal logs, the cluster was totally happy and SF/RF was met.
Not saying this is the correct approach, but the reason you have a cluster is to tolerate failures like this.

View solution in original post

New Member

Had this issue with a 7.04 indexer/peer; restarted Splunk (had to kill -9 the old restart processes as they were causing restart to hang too) .. once restarted re-ran 'splunk offline --enforce-counts' and it worked fine.

0 Karma

Ultra Champion

You might want to consider opening a ticket with support for this.

But..
Depending on your data volumes, If you have good replication and search factors, you could just fail the node. (ie pull the plug on it)
Your cluster will rebuild rep/search factor, but that's all offline --enforce-counts is doing (albeit more gracefully, and without warning you your cluster is inconsistent).

I had a peer which never finished decom after being left for 14 days. In the end we just turned it off, and apart from about 30 seconds of fixup on its internal logs, the cluster was totally happy and SF/RF was met.
Not saying this is the correct approach, but the reason you have a cluster is to tolerate failures like this.

View solution in original post

New Member

I inherited this deployment and eventually someone who worked on the original project told me to just shut it down and remove it from the master once it lost connectivity. Seriously.

0 Karma

Explorer

When you just pull the plug, be aware that searches wont get the full set of data as long the bucket fixing goes on. That might be a consideration depending on the data you deal with. But I had the same experience with 6.5.5. All the indexers to be decom'ed ran out of disk space in their ../run/searchpeers folder and end up with a log message like:

ERROR SearchProcessRunner - launcher_thread=0 runSearch exception: PreforkedSearchProcessException: can't create preforked search process: Cannot send after transport endpoint shutdown

which at the end caused more trouble than just pulling the plug.

Explorer

I just had the same issue, I found that you can click on the grey out bucket, you will access to the "bucket Status" menu.
Once you are on that menu you will see all the buckets that are waiting to be replicated, click on Action and choose the "Roll" option.
It will force the bucket to be replicated.

0 Karma