Solved: Indexer fails to join back cluster due to standalo...

keio_splunk · ‎03-12-2019

Indexer in the cluster was abruptly shutdown and subsequently fail to join back to the cluster. Please help to provide the steps to clean up the standalone buckets to allow the indexer to join back to the cluster.

warning message in splunkd.log:
xx-xx-xxxx xx:xx:xx.xxx -0500 WARN CMSlave - Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json master=xxx.xxx.xxx:8089 rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line=“Internal Server Error” socket_error=“No error” remote_error=Cannot add peer=xxx.xxx.xxx.xxx mgmtport=8089 (reason: bucket already added as clustered, peer attempted to add again as standalone. guid=C199873F-6E72-43D8-B54F-554750ACE904 bid= mi_batch~314~C199873F-6E72-43D8-B54F-554750ACE904). [ event=addPeer status=retrying AddPeerRequest: { _id= active_bundle_id=403F2E7869E35F5BB8C945D993035AA2 add_type=Initial-Add base_generation_id=0 batch_serialno=7 batch_size=18 forwarderdata_rcv_port=9997 forwarderdata_use_ssl=0 last_complete_generation_id=0 latest_bundle_id=403F2E7869E35F5BB8C945D993035AA2 mgmt_port=8089 name=C199873F-6E72-43D8-B54F-554750ACE904 register_forwarder_address= register_replication_address= register_search_address= replication_port=8003 replication_use_ssl=0 replications= server_name=xxx.xxx.xxx site=default splunk_version=7.2.0 splunkd_build_number=8c86330ac18 status=Up } ].

keio_splunk · ‎03-12-2019

When the indexer is disabled as search peer, the hot buckets are rolled over to warm using the standalone bucket naming convention. When the peer is re-enabled subsequently, the cluster master remembers those buckets as clustered and expects the buckets to be named in the clustered bucket convention but it was not the case and it had to reject the peer request to rejoin the cluster. More details in Unable to disable and re-enable a peer.

Here are the the steps to rename the standalone buckets to clustered bucket convention:

Search for the offending standalone buckets in the bucket directory (Default location: $SPLUNK_HOME/var/lib/splunk/*/db/).
Scan through the indexes db-folders to find the standalone buckets. Naming convention of standalone buckets that are problematic: db_<newest_time><oldest_time><bucketid>. i.e. db_1550812574_1550720467_53
Append the cluster master GUID to the standalone buckets: Rename from db_<newest_time><oldest_time><bucketid> to db_<newest_time><oldest_time><bucketid>_<guid> i.e. db_1550812574_1550720467_53_C199873F-6E72-43D8-B54F-554750ACE904 Note: guid=C199873F-6E72-43D8-B54F-554750ACE904
Restart the indexer and it will rejoin back to the cluster.

View solution in original post

abhirupS · ‎01-09-2023

To solve this problem you need to find and rename the offending bucket. If there is many such buckets it is not possible to manually rename them.

How to find and rename the offending standalone buckets?

find . -regextype posix-extended -regex '^.*db_[0-9]+_[0-9]+_[0-9]+$' -exec mv {} {}_C199873F-6E72-43D8-B54F-554750ACE904 \;

master guid=C199873F-6E72-43D8-B54F-554750ACE904

keio_splunk · ‎03-12-2019

When the indexer is disabled as search peer, the hot buckets are rolled over to warm using the standalone bucket naming convention. When the peer is re-enabled subsequently, the cluster master remembers those buckets as clustered and expects the buckets to be named in the clustered bucket convention but it was not the case and it had to reject the peer request to rejoin the cluster. More details in Unable to disable and re-enable a peer.

Here are the the steps to rename the standalone buckets to clustered bucket convention:

Search for the offending standalone buckets in the bucket directory (Default location: $SPLUNK_HOME/var/lib/splunk/*/db/).
Scan through the indexes db-folders to find the standalone buckets. Naming convention of standalone buckets that are problematic: db_<newest_time><oldest_time><bucketid>. i.e. db_1550812574_1550720467_53
Append the cluster master GUID to the standalone buckets: Rename from db_<newest_time><oldest_time><bucketid> to db_<newest_time><oldest_time><bucketid>_<guid> i.e. db_1550812574_1550720467_53_C199873F-6E72-43D8-B54F-554750ACE904 Note: guid=C199873F-6E72-43D8-B54F-554750ACE904
Restart the indexer and it will rejoin back to the cluster.

dm1 · ‎11-16-2021

how did you manage to find the standalone bucket using that naming convention ? can u plz give an example ?

edoardo_vicendo · ‎10-22-2021

Thank you, I had exactly the same issue. During the upgrade, with Cluster Master in maintenance mode, the affected Indexer had an outage at storage level and then it was unable to join back the cluster.

I solved with proposed steps, just wanted to add that not all the buckets have to be renamed, just the ones that are replicated (for instance in our environment metrics and other specific Splunk indexes are not)

rwsisson · ‎06-04-2021

One correct per Splunk docs (and observation) the GUID is the GUID of the local indexer:

How the indexer stores indexes - Splunk Documentation

Look at the bucket naming convention section

<guid> is the guid of the source peer node. The guid is located in the peer's $SPLUNK_HOME/etc/instance.cfg file.

cfcvendorsuppor · ‎12-10-2019

Thanks ! It help me to recover 2 failed nodes in my cluster

esalesapns2 · ‎04-11-2019

Thanks, Keio! Clarification: in step #2, "Scan through the indexes db-folders" means var/lib/splunk/*/db/ , not just var/lib/splunk/defaultdb/db/.

keio_splunk · ‎04-11-2019

Thanks for the clarification, have revised the path to the indexes db-folders to $SPLUNK_HOME/var/lib/splunk/*/db/.

Indexer fails to join back cluster due to standalone buckets?

indexer

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

SignalFlow: What? Why? How?