Getting Data In

Indexer fails to join back cluster due to standalone buckets?

keio_splunk
Splunk Employee
Splunk Employee

Indexer in the cluster was abruptly shutdown and subsequently fail to join back to the cluster. Please help to provide the steps to clean up the standalone buckets to allow the indexer to join back to the cluster.

warning message in splunkd.log:
xx-xx-xxxx xx:xx:xx.xxx -0500 WARN CMSlave - Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json master=xxx.xxx.xxx:8089 rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line=“Internal Server Error” socket_error=“No error” remote_error=Cannot add peer=xxx.xxx.xxx.xxx mgmtport=8089 (reason: bucket already added as clustered, peer attempted to add again as standalone. guid=C199873F-6E72-43D8-B54F-554750ACE904 bid= mi_batch~314~C199873F-6E72-43D8-B54F-554750ACE904). [ event=addPeer status=retrying AddPeerRequest: { _id= active_bundle_id=403F2E7869E35F5BB8C945D993035AA2 add_type=Initial-Add base_generation_id=0 batch_serialno=7 batch_size=18 forwarderdata_rcv_port=9997 forwarderdata_use_ssl=0 last_complete_generation_id=0 latest_bundle_id=403F2E7869E35F5BB8C945D993035AA2 mgmt_port=8089 name=C199873F-6E72-43D8-B54F-554750ACE904 register_forwarder_address= register_replication_address= register_search_address= replication_port=8003 replication_use_ssl=0 replications= server_name=xxx.xxx.xxx site=default splunk_version=7.2.0 splunkd_build_number=8c86330ac18 status=Up } ].

Labels (1)
0 Karma
1 Solution

keio_splunk
Splunk Employee
Splunk Employee

When the indexer is disabled as search peer, the hot buckets are rolled over to warm using the standalone bucket naming convention. When the peer is re-enabled subsequently, the cluster master remembers those buckets as clustered and expects the buckets to be named in the clustered bucket convention but it was not the case and it had to reject the peer request to rejoin the cluster. More details in Unable to disable and re-enable a peer.

Here are the the steps to rename the standalone buckets to clustered bucket convention:

  1. Search for the offending standalone buckets in the bucket directory (Default location: $SPLUNK_HOME/var/lib/splunk/*/db/).
  2. Scan through the indexes db-folders to find the standalone buckets. Naming convention of standalone buckets that are problematic: db_<newest_time><oldest_time><bucketid>. i.e. db_1550812574_1550720467_53
  3. Append the cluster master GUID to the standalone buckets: Rename from db_<newest_time><oldest_time><bucketid> to db_<newest_time><oldest_time><bucketid>_<guid> i.e. db_1550812574_1550720467_53_C199873F-6E72-43D8-B54F-554750ACE904 Note: guid=C199873F-6E72-43D8-B54F-554750ACE904
  4. Restart the indexer and it will rejoin back to the cluster.

View solution in original post

abhirupS
Observer

To solve this problem you need to find and rename the offending bucket. If there is many such buckets it is not possible to manually rename them.

How to find and rename the offending standalone buckets?

find . -regextype posix-extended -regex '^.*db_[0-9]+_[0-9]+_[0-9]+$' -exec mv {} {}_C199873F-6E72-43D8-B54F-554750ACE904 \;

master guid=C199873F-6E72-43D8-B54F-554750ACE904

0 Karma

keio_splunk
Splunk Employee
Splunk Employee

When the indexer is disabled as search peer, the hot buckets are rolled over to warm using the standalone bucket naming convention. When the peer is re-enabled subsequently, the cluster master remembers those buckets as clustered and expects the buckets to be named in the clustered bucket convention but it was not the case and it had to reject the peer request to rejoin the cluster. More details in Unable to disable and re-enable a peer.

Here are the the steps to rename the standalone buckets to clustered bucket convention:

  1. Search for the offending standalone buckets in the bucket directory (Default location: $SPLUNK_HOME/var/lib/splunk/*/db/).
  2. Scan through the indexes db-folders to find the standalone buckets. Naming convention of standalone buckets that are problematic: db_<newest_time><oldest_time><bucketid>. i.e. db_1550812574_1550720467_53
  3. Append the cluster master GUID to the standalone buckets: Rename from db_<newest_time><oldest_time><bucketid> to db_<newest_time><oldest_time><bucketid>_<guid> i.e. db_1550812574_1550720467_53_C199873F-6E72-43D8-B54F-554750ACE904 Note: guid=C199873F-6E72-43D8-B54F-554750ACE904
  4. Restart the indexer and it will rejoin back to the cluster.

dm1
Contributor

how did you manage to find the standalone bucket using that naming convention ? can u plz give an example ?

0 Karma

edoardo_vicendo
Builder

Thank you, I had exactly the same issue. During the upgrade, with Cluster Master in maintenance mode, the affected Indexer had an outage at storage level and then it was unable to join back the cluster.

I solved with proposed steps, just wanted to add that not all the buckets have to be renamed, just the ones that are replicated (for instance in our environment metrics and other specific Splunk indexes are not)

0 Karma

rwsisson
Explorer

One correct per Splunk docs (and observation) the GUID is the GUID of the local indexer:

How the indexer stores indexes - Splunk Documentation

Look at the bucket naming convention section

  • <guid> is the guid of the source peer node. The guid is located in the peer's $SPLUNK_HOME/etc/instance.cfg file.
0 Karma

cfcvendorsuppor
Explorer

Thanks ! It help me to recover 2 failed nodes in my cluster

0 Karma

esalesapns2
Communicator

Thanks, Keio! Clarification: in step #2, "Scan through the indexes db-folders" means var/lib/splunk/*/db/ , not just var/lib/splunk/defaultdb/db/.

0 Karma

keio_splunk
Splunk Employee
Splunk Employee

Thanks for the clarification, have revised the path to the indexes db-folders to $SPLUNK_HOME/var/lib/splunk/*/db/.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...