We are having issues with bundles replication, as described in SPL-52901, causing the indexer to not be able to re-connect to the cluster master.
Does anyone have a workaround until we can upgrade from 6.1.3 to 6.1.4?
Index replication issues
Disabling clustering on a peer node and then attempting to re-enable it later causes hot buckets to be handled incorrectly, with the consequence that the peer cannot be added back into the cluster. This scenario occurs when you take an existing peer node and disable clustering on it (turning it into a standalone indexer), and then you subsequently re-enable clustering to turn it back into a peer on its original cluster. In this situation, any hot buckets that were created on the peer but not rolled when clustering was still enabled, will get rolled after you disable clustering and restart the indexer. At that point, they get marked as standalone buckets, since the indexer is no longer a peer. Those buckets, however, also exist on the remaining cluster as replicated buckets, since they were streamed to other peers while the indexer in question was still a peer. If you then re-enable clustering on the peer and restart it, the bucket conflict causes the peer to fail to register with the master. (SPL-52901)
I haven't seen or tried this, but seems that a workaround would have to be to locate the conflicting buckets on the cluster, delete them, then re-add the peer to the cluster.
OK, so my boss has explained it to me now.
Buckets with a db prefix and no peer suffix are created on a single site indexer (no clustering)
Buckets with a db prefix and a peer suffix are created on a clustered indexer and a replication factor of 1
Buckets with an rb prefix and a peer suffix are created on a clustered indexer and a replication factor of 1 or more.
So we've removed the buckets referenced as invalid in splunkd.log and it can reconnect again now.
rb_ is specifically a replicated bucket, meaning it is there because another peer replicated it to this peer. db_ buckets are the primary buckets when the repfactor is greater than 1.
I haven't seen or tried this, but seems that a workaround would have to be to locate the conflicting buckets on the cluster, delete them, then re-add the peer to the cluster.
Hi Gerald, thanks for the pointers. We did attempt this over the weekend, but I don't know if due to caffeine deprivation or just too much time staring at the bucket directories, I can't find the bucket that is causing the issue.
My error is like this :-
10-11-2014 13:40:09.391 +1100 ERROR ClusterMasterPeerHandler - Cannot add peer=xxx.xxx.xxx.xxx mgmtport=8089 (reason: standalone bid=os_secure~1019~63E6F99E-906C-4AB3-BD04-72C16B9816BE is in an invalid state on peer=63E6F99E-906C-4AB3-BD04-72C16B9816BE bf: mask=0x6 status=Complete searchstate=Unsearchable cksum= cksumstate=StableCksum). Make sure pass4SymmKey is matching if the peer is running well.
So if I understand correctly I should be looking in :
/opt/splunk/var/lib/splunk/volume_cold/os_secure/db_Timestamp_Timestamp_1019_63E6F99E_906C_4AB3_BD04_72C16B9816BE/
(or volume_hotwarm)
In volume_cold/os_secure I have the following.
drwx--x--x 3 splunk splunk 12288 Oct 9 14:48 db_1393537610_1315806661_1019
Would that be the same bucket even if doesn't have the peer guid reference?