Deployment Architecture

6.5.2 unable to decommission an indexer peer node. keeps coming back online after offline command is issued.

acamilo_2
New Member

What I've tried:

  1. On the indexer:
    splunk offline --enforce-counts
    On the master, observing splunk_monitoring_console/indexer_clustering_status
    indexer goes to decommissioning but goes back to on after a few seconds.

  2. On the indexer:
    splunk offline
    On the master, observe splunk_monitoring_console/indexer_clustering_status
    indexer goes away but after a few seconds, it returns to on.

  3. On the master:
    splunk edit cluster-config -restart_timeout 1800
    restart splunk
    On the indexer:
    splunk offline --enforce-counts
    On the master
    observe splunk_monitoring_console/indexer_clustering_status
    indexer goes to decommissioning but goes back to on after a few seconds.

Thanks in advance.

0 Karma
1 Solution

nickhills
Ultra Champion

You might want to consider opening a ticket with support for this.

But..
Depending on your data volumes, If you have good replication and search factors, you could just fail the node. (ie pull the plug on it)
Your cluster will rebuild rep/search factor, but that's all offline --enforce-counts is doing (albeit more gracefully, and without warning you your cluster is inconsistent).

I had a peer which never finished decom after being left for 14 days. In the end we just turned it off, and apart from about 30 seconds of fixup on its internal logs, the cluster was totally happy and SF/RF was met.
Not saying this is the correct approach, but the reason you have a cluster is to tolerate failures like this.

If my comment helps, please give it a thumbs up!

View solution in original post

risfehani
New Member

Had this issue with a 7.04 indexer/peer; restarted Splunk (had to kill -9 the old restart processes as they were causing restart to hang too) .. once restarted re-ran 'splunk offline --enforce-counts' and it worked fine.

0 Karma

nickhills
Ultra Champion

You might want to consider opening a ticket with support for this.

But..
Depending on your data volumes, If you have good replication and search factors, you could just fail the node. (ie pull the plug on it)
Your cluster will rebuild rep/search factor, but that's all offline --enforce-counts is doing (albeit more gracefully, and without warning you your cluster is inconsistent).

I had a peer which never finished decom after being left for 14 days. In the end we just turned it off, and apart from about 30 seconds of fixup on its internal logs, the cluster was totally happy and SF/RF was met.
Not saying this is the correct approach, but the reason you have a cluster is to tolerate failures like this.

If my comment helps, please give it a thumbs up!

acamilo_2
New Member

I inherited this deployment and eventually someone who worked on the original project told me to just shut it down and remove it from the master once it lost connectivity. Seriously.

0 Karma

pfender
Explorer

When you just pull the plug, be aware that searches wont get the full set of data as long the bucket fixing goes on. That might be a consideration depending on the data you deal with. But I had the same experience with 6.5.5. All the indexers to be decom'ed ran out of disk space in their ../run/searchpeers folder and end up with a log message like:

ERROR SearchProcessRunner - launcher_thread=0 runSearch exception: PreforkedSearchProcessException: can't create preforked search process: Cannot send after transport endpoint shutdown

which at the end caused more trouble than just pulling the plug.

davidda
Explorer

I just had the same issue, I found that you can click on the grey out bucket, you will access to the "bucket Status" menu.
Once you are on that menu you will see all the buckets that are waiting to be replicated, click on Action and choose the "Roll" option.
It will force the bucket to be replicated.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...