Indexer in the cluster was abruptly shutdown and subsequently fail to join back to the cluster. Please help to provide the steps to clean up the standalone buckets to allow the indexer to join back to the cluster.
warning message in splunkd.log:
xx-xx-xxxx xx:xx:xx.xxx -0500 WARN CMSlave - Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json master=xxx.xxx.xxx:8089 rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line=“Internal Server Error” socket_error=“No error” remote_error=Cannot add peer=xxx.xxx.xxx.xxx mgmtport=8089 (reason: bucket already added as clustered, peer attempted to add again as standalone. guid=C199873F-6E72-43D8-B54F-554750ACE904 bid= mi_batch~314~C199873F-6E72-43D8-B54F-554750ACE904). [ event=addPeer status=retrying AddPeerRequest: { _id= active_bundle_id=403F2E7869E35F5BB8C945D993035AA2 add_type=Initial-Add base_generation_id=0 batch_serialno=7 batch_size=18 forwarderdata_rcv_port=9997 forwarderdata_use_ssl=0 last_complete_generation_id=0 latest_bundle_id=403F2E7869E35F5BB8C945D993035AA2 mgmt_port=8089 name=C199873F-6E72-43D8-B54F-554750ACE904 register_forwarder_address= register_replication_address= register_search_address= replication_port=8003 replication_use_ssl=0 replications= server_name=xxx.xxx.xxx site=default splunk_version=7.2.0 splunkd_build_number=8c86330ac18 status=Up } ].
When the indexer is disabled as search peer, the hot buckets are rolled over to warm using the standalone bucket naming convention. When the peer is re-enabled subsequently, the cluster master remembers those buckets as clustered and expects the buckets to be named in the clustered bucket convention but it was not the case and it had to reject the peer request to rejoin the cluster. More details in Unable to disable and re-enable a peer.
Here are the the steps to rename the standalone buckets to clustered bucket convention:
To solve this problem you need to find and rename the offending bucket. If there is many such buckets it is not possible to manually rename them.
How to find and rename the offending standalone buckets?
find . -regextype posix-extended -regex '^.*db_[0-9]+_[0-9]+_[0-9]+$' -exec mv {} {}_C199873F-6E72-43D8-B54F-554750ACE904 \;
master guid=C199873F-6E72-43D8-B54F-554750ACE904
When the indexer is disabled as search peer, the hot buckets are rolled over to warm using the standalone bucket naming convention. When the peer is re-enabled subsequently, the cluster master remembers those buckets as clustered and expects the buckets to be named in the clustered bucket convention but it was not the case and it had to reject the peer request to rejoin the cluster. More details in Unable to disable and re-enable a peer.
Here are the the steps to rename the standalone buckets to clustered bucket convention:
how did you manage to find the standalone bucket using that naming convention ? can u plz give an example ?
Thank you, I had exactly the same issue. During the upgrade, with Cluster Master in maintenance mode, the affected Indexer had an outage at storage level and then it was unable to join back the cluster.
I solved with proposed steps, just wanted to add that not all the buckets have to be renamed, just the ones that are replicated (for instance in our environment metrics and other specific Splunk indexes are not)
One correct per Splunk docs (and observation) the GUID is the GUID of the local indexer:
How the indexer stores indexes - Splunk Documentation
Look at the bucket naming convention section
Thanks ! It help me to recover 2 failed nodes in my cluster
Thanks, Keio! Clarification: in step #2, "Scan through the indexes db-folders" means var/lib/splunk/*/db/ , not just var/lib/splunk/defaultdb/db/.
Thanks for the clarification, have revised the path to the indexes db-folders to $SPLUNK_HOME/var/lib/splunk/*/db/.