Splunk Enterprise

Unable to Add Cluster Peer Back to Cluster master - get Error -Adding a non-standalone bucket as standalone!

rbal_splunk
Splunk Employee
Splunk Employee

My cluster peers went down after making some changes to indexes.conf on the master and now I am unable to add a peer in our two peer environment. Our Splunk instances are all 5.0.4 running RHEL.

In splunkd.log, I see the following types of errors on the cluster master:

12-16-2013 18:25:19.050 -0800 ERROR CMMaster - event=addPeer guid=XXXXXXXX-XXX-XXX-XXX-XXXXXXXXXXXX status=failed err="Adding a non-standalone bucket as standalone!"

12-16-2013 18:25:19.052 -0800 ERROR ClusterMasterPeerHandler - Cannot add peer=xxx.xxx.xxx.xxx mgmtport=8089 (reason: Adding a non-standalone bucket as standalone!)

We stopped the cluster master and both peers, removed the bucket being complained about, then brought up both peers and the master. After that, the peer which was previously reported as down was listed as up. However, now the peer which was up was listed as down, and the logs showed similar errors about this peer

Has anyone else see this, and what can be done to resolve this error?

Labels (1)
Tags (2)
1 Solution

jbsplunk
Splunk Employee
Splunk Employee

We saw this problem after editing indexes.conf, which was being done in $SPLUNK_HOME/etc/master-apps/_cluster/default/indexes.conf(by mistake). Since new indexes were being added, our theory is things turned sour during this process which caused the peers to believe they weren't members of the cluster and start indexing buckets as stand alone. We saw a number of buckets created without the instance GUID prepended to them, which is the format of standalone buckets.

To resolve the issue so that we could add the peers, we needed to remove all of the buckets that threw messages identifying them as standalone on each peer.

The messages we used looked like this:

12-16-2013 18:25:19.050 -0800 INFO CMMaster - Adding bid=<index_name>~<bucketid>~<instanceGUID> (status='Complete' search_status='Searchable' mask=xxxxxxxxxxxxxxxxxxxxx checksum= standalone=yes size=1007447 genid=0) to peer=<peerGUID>

We found about 14 buckets between both the Peers that were created without the GUID and/or were listed as standalone buckets.

----Steps Taken to restore the peers to cluster----

  1. Stopped Cluster Mater and both Peers, in that order. It was important that things be done in this order.
  2. Identified the list of buckets in Peer1 and Peer2 that are missing GUIDs and/or otherwise identified as standalone by reviewing splunkd.log as per the above explanation.
  3. Moved each of these buckets to /temp folder created for each index, in order to keep track of their source for restoration purposes.
  4. We moved the custom pieces of indexes.conf mistakenly introduced to $SPLUNK_HOME/etc/master-apps/_cluster/local/indexes.conf.
  5. Restarted each of the peers, then the master, in that order. Also, as per step 1, the order in which instances are restarted is important. For details, see the documentation.
  6. Checked on the master to validate that all peers had successfully been added and that our new indexes.conf existed in $SPLUNK_HOME/etc/slave-apps/_cluster/local/indexes.conf .

At this point, the cluster was back up and working. Now we needed to reintroduce the standalone buckets back into the cluster. Here is the process we followed:

  1. In /temp/ we renamed the buckets and prepended the GUID, to force compliance of with naming conversion of non standalone buckets. This guid can be found in $SPLUNK_HOME/etc/instance.cfg. Here is an example: mv db_<newest_time>_<oldest_time>_<bucketid> db_<newest_time>_<oldest_time>_<bucketid>_<guid>

  2. Copied buckets db_<newest_time>_<oldest_time>_<bucketid>_<guid> to $SPLUNK_HOME/IndexName/db folder on local instance.

  3. Created the replicated bucket by copying db_<newest_time>_<oldest_time>_<bucketid>_<guid> to rb_<newest_time>_<oldest_time>_<bucketid>_<guid>

  4. Copied buckets rb_<newest_time>_<oldest_time>_<bucketid>_<guid> to `$SPLUNK_DB/IndexName/colddb' folder on the remote peer in the proper index.

  5. Stopped both peers, then cluster master(again, order matters.)

  6. Started both indexer peers, then cluster master.

  7. Verified that Bucket is Visible to Cluster Master REST endpoint by navigation to SOS App > Indexing> Index Replication> Cluster Master view. Reviewing ‘Bucket information’ showed our copied
    buckets as searchable.

View solution in original post

shivank_rana
Engager

I got the similar issue
04-22-2020 23:08:28.867 +0800 ERROR ClusterMasterPeerHandler - Cannot add peer=xxx.xx.xx.xxx mgmtport=8089 (reason: bucket already added as clustered, peer attempted to add again as standalone. guid=2A40ED04-E90B-4771-BD5A-F523865808B6 bid= ~2~2A40ED04-E90B-4771-BD5A-F523865808B6).
thanks to cluster, My data safe to another peer node and it was searchable.

  1. Stop my index peer which was no behaving "./splunk stop"
  2. Move the whole index dir as _old (mine was in defalut dir - $splunk/var/lib/splunk/)
  3. Start the index again. "./splunk start"
  4. Vola ! error gone and both peer show "up" in index cluster, that the good sign (durability and searchable was still red )
  5. Last, Start the re-balance

There is the feature "data re-balance"
Setting -> index clustering -> edit -> Data rebalance -> Select the index which you moved (I kept threshold 1, and it worked for me) -> start

My index was re-created and data came back on issue peer automatically, all green from UI, availability was 100% and all data was searchable.

-Rana

ehollima
Path Finder

Have to say Raj Pal ROCKS!

So does Masa and John Welch in no given order!!!!!

Arion Holliman - AIG Lead SIEM Engineer

ejenson_splunk
Splunk Employee
Splunk Employee

Adding additional detail. After finding the offending bucket as suggested by Rajpal we simply did the following.

  1. Ensured our cluster had recovered and was meeting search and replication factor.
  2. stopped splunk on the bad indexer.
  3. backed up the offending bucket directory.
  4. Deleted the entire bucket directory.
  5. Restarted splunk.

Since the cluster was essentially recovered simply removing the directory completely did not cause issues.

Eric.

ehollima
Path Finder

We are using version 6.5.2, I discovered something similar. Our searchheads were showing an error: Failed to add peer 'guid=B193F763-99BC-41B5-89D0-CDEF1F1BF36E server name=splunkindexer14 ip=192.168.12.24:8089' to the master.
Error=bucket already added as clustered, peer attempted to add again as standalone. guid=B193F763-99BC-41B5-89D0-CDEF1F1BF36E bid= myindex~38~B193F763-99BC-41B5-89D0-CDEF1F1BF36E

) To fix this I needed to find the bucket:

Log into the indexer, become splunk, get in the path of the offending bucket and find it.
$> su - splunk
$> cd /x/x1/db/<index>
$> ls -lh

2) The problem bucket will stand out like a sore thumb and be easy to spot because its name will be shorter than all the others.

Actual standalone bucket that caused the problem:   db_1487878740_1487878740_38

3) Rename it using move as shown below, adding the CM GUID: (leaving out the double quotes)

$> mv “db_1487878740_1487878740_38” “db_1487878740_1487878740_38_B193F763-99BC-41B5-89D0-CDEF1F1BF36E”

4) Now do list again ( “ls | grep ” ) and verify you were successful, you should see the renamed bucket:

$> ls | grep db_1487878740_1487878740_38

db_1487878740_1487878740_38_B193F763-99BC-41B5-89D0-CDEF1F1BF36E

5) Reboot the indexer

Many thanks Rajpal!

shivank_rana
Engager

what if there is more than one folder? (name will be shorter than all the others)

0 Karma

chandanghoshCTL
Explorer

This steps worked on version 7.0.2 , i am new to Splunk and implementing in AWS ,
excellent help ' The problem bucket will stand out like a sore thumb '

0 Karma

baselahmad
New Member

This worked perfectly on 7.3.0
Thank you Rajpal!

0 Karma

jbsplunk
Splunk Employee
Splunk Employee

We saw this problem after editing indexes.conf, which was being done in $SPLUNK_HOME/etc/master-apps/_cluster/default/indexes.conf(by mistake). Since new indexes were being added, our theory is things turned sour during this process which caused the peers to believe they weren't members of the cluster and start indexing buckets as stand alone. We saw a number of buckets created without the instance GUID prepended to them, which is the format of standalone buckets.

To resolve the issue so that we could add the peers, we needed to remove all of the buckets that threw messages identifying them as standalone on each peer.

The messages we used looked like this:

12-16-2013 18:25:19.050 -0800 INFO CMMaster - Adding bid=<index_name>~<bucketid>~<instanceGUID> (status='Complete' search_status='Searchable' mask=xxxxxxxxxxxxxxxxxxxxx checksum= standalone=yes size=1007447 genid=0) to peer=<peerGUID>

We found about 14 buckets between both the Peers that were created without the GUID and/or were listed as standalone buckets.

----Steps Taken to restore the peers to cluster----

  1. Stopped Cluster Mater and both Peers, in that order. It was important that things be done in this order.
  2. Identified the list of buckets in Peer1 and Peer2 that are missing GUIDs and/or otherwise identified as standalone by reviewing splunkd.log as per the above explanation.
  3. Moved each of these buckets to /temp folder created for each index, in order to keep track of their source for restoration purposes.
  4. We moved the custom pieces of indexes.conf mistakenly introduced to $SPLUNK_HOME/etc/master-apps/_cluster/local/indexes.conf.
  5. Restarted each of the peers, then the master, in that order. Also, as per step 1, the order in which instances are restarted is important. For details, see the documentation.
  6. Checked on the master to validate that all peers had successfully been added and that our new indexes.conf existed in $SPLUNK_HOME/etc/slave-apps/_cluster/local/indexes.conf .

At this point, the cluster was back up and working. Now we needed to reintroduce the standalone buckets back into the cluster. Here is the process we followed:

  1. In /temp/ we renamed the buckets and prepended the GUID, to force compliance of with naming conversion of non standalone buckets. This guid can be found in $SPLUNK_HOME/etc/instance.cfg. Here is an example: mv db_<newest_time>_<oldest_time>_<bucketid> db_<newest_time>_<oldest_time>_<bucketid>_<guid>

  2. Copied buckets db_<newest_time>_<oldest_time>_<bucketid>_<guid> to $SPLUNK_HOME/IndexName/db folder on local instance.

  3. Created the replicated bucket by copying db_<newest_time>_<oldest_time>_<bucketid>_<guid> to rb_<newest_time>_<oldest_time>_<bucketid>_<guid>

  4. Copied buckets rb_<newest_time>_<oldest_time>_<bucketid>_<guid> to `$SPLUNK_DB/IndexName/colddb' folder on the remote peer in the proper index.

  5. Stopped both peers, then cluster master(again, order matters.)

  6. Started both indexer peers, then cluster master.

  7. Verified that Bucket is Visible to Cluster Master REST endpoint by navigation to SOS App > Indexing> Index Replication> Cluster Master view. Reviewing ‘Bucket information’ showed our copied
    buckets as searchable.

salem34
Path Finder

I know this is an old post but i wonder wheter it is not simplier to just move the pirmary (db)buckets and let the master do the copies using the regular replication process instead of copying the rb buckets manually to the target peers? Or am i missing something?

Cheers
Claudio

0 Karma
Get Updates on the Splunk Community!

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Hey Splunky People! We are excited to share the latest updates in Splunk Enterprise 9.4. In this release we ...

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

SignalFlow: What? Why? How?

What is SignalFlow? Splunk Observability Cloud’s analytics engine, SignalFlow, opens up a world of in-depth ...