Getting Data In

Indexer fails to join back cluster due to standalone buckets

keio_splunk
Splunk Employee
Splunk Employee

Indexer in the cluster was abruptly shutdown and subsequently fail to join back to the cluster. Please help to provide the steps to clean up the standalone buckets to allow the indexer to join back to the cluster.

warning message in splunkd.log:
xx-xx-xxxx xx:xx:xx.xxx -0500 WARN CMSlave - Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json master=xxx.xxx.xxx:8089 rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line=“Internal Server Error” socket_error=“No error” remote_error=Cannot add peer=xxx.xxx.xxx.xxx mgmtport=8089 (reason: bucket already added as clustered, peer attempted to add again as standalone. guid=C199873F-6E72-43D8-B54F-554750ACE904 bid= mi_batch~314~C199873F-6E72-43D8-B54F-554750ACE904). [ event=addPeer status=retrying AddPeerRequest: { _id= active_bundle_id=403F2E7869E35F5BB8C945D993035AA2 add_type=Initial-Add base_generation_id=0 batch_serialno=7 batch_size=18 forwarderdata_rcv_port=9997 forwarderdata_use_ssl=0 last_complete_generation_id=0 latest_bundle_id=403F2E7869E35F5BB8C945D993035AA2 mgmt_port=8089 name=C199873F-6E72-43D8-B54F-554750ACE904 register_forwarder_address= register_replication_address= register_search_address= replication_port=8003 replication_use_ssl=0 replications= server_name=xxx.xxx.xxx site=default splunk_version=7.2.0 splunkd_build_number=8c86330ac18 status=Up } ].

0 Karma
1 Solution

keio_splunk
Splunk Employee
Splunk Employee

When the indexer is disabled as search peer, the hot buckets are rolled over to warm using the standalone bucket naming convention. When the peer is re-enabled subsequently, the cluster master remembers those buckets as clustered and expects the buckets to be named in the clustered bucket convention but it was not the case and it had to reject the peer request to rejoin the cluster. More details in Unable to disable and re-enable a peer.

Here are the the steps to rename the standalone buckets to clustered bucket convention:

  1. Search for the offending standalone buckets in the bucket directory (Default location: $SPLUNK_HOME/var/lib/splunk/*/db/).
  2. Scan through the indexes db-folders to find the standalone buckets. Naming convention of standalone buckets that are problematic: db_<newest_time><oldest_time><bucketid>. i.e. db_1550812574_1550720467_53
  3. Append the cluster master GUID to the standalone buckets: Rename from db_<newest_time><oldest_time><bucketid> to db_<newest_time><oldest_time><bucketid>_<guid> i.e. db_1550812574_1550720467_53_C199873F-6E72-43D8-B54F-554750ACE904 Note: guid=C199873F-6E72-43D8-B54F-554750ACE904
  4. Restart the indexer and it will rejoin back to the cluster.

View solution in original post

keio_splunk
Splunk Employee
Splunk Employee

When the indexer is disabled as search peer, the hot buckets are rolled over to warm using the standalone bucket naming convention. When the peer is re-enabled subsequently, the cluster master remembers those buckets as clustered and expects the buckets to be named in the clustered bucket convention but it was not the case and it had to reject the peer request to rejoin the cluster. More details in Unable to disable and re-enable a peer.

Here are the the steps to rename the standalone buckets to clustered bucket convention:

  1. Search for the offending standalone buckets in the bucket directory (Default location: $SPLUNK_HOME/var/lib/splunk/*/db/).
  2. Scan through the indexes db-folders to find the standalone buckets. Naming convention of standalone buckets that are problematic: db_<newest_time><oldest_time><bucketid>. i.e. db_1550812574_1550720467_53
  3. Append the cluster master GUID to the standalone buckets: Rename from db_<newest_time><oldest_time><bucketid> to db_<newest_time><oldest_time><bucketid>_<guid> i.e. db_1550812574_1550720467_53_C199873F-6E72-43D8-B54F-554750ACE904 Note: guid=C199873F-6E72-43D8-B54F-554750ACE904
  4. Restart the indexer and it will rejoin back to the cluster.

View solution in original post

rwsisson
Explorer

One correct per Splunk docs (and observation) the GUID is the GUID of the local indexer:

How the indexer stores indexes - Splunk Documentation

Look at the bucket naming convention section

  • <guid> is the guid of the source peer node. The guid is located in the peer's $SPLUNK_HOME/etc/instance.cfg file.
0 Karma

cfcvendorsuppor
Explorer

Thanks ! It help me to recover 2 failed nodes in my cluster

0 Karma

esalesapns2
Path Finder

Thanks, Keio! Clarification: in step #2, "Scan through the indexes db-folders" means var/lib/splunk/*/db/ , not just var/lib/splunk/defaultdb/db/.

0 Karma

keio_splunk
Splunk Employee
Splunk Employee

Thanks for the clarification, have revised the path to the indexes db-folders to $SPLUNK_HOME/var/lib/splunk/*/db/.

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!