Deployment Architecture

Why am I getting these failed bucket replication errors on each indexer in a cluster?

Explorer

I have two indexers set for a 2:2 configuration for replication/search factor. All has been fine until a couple of weeks ago when an error crept in. The problem began before I upgraded the cluster to 6.4.0 on Sunday and I've been trying to sort this out. Logs snippet that keep repeating from splunkd.log on each indexer are below:

Indexer 1

04-19-2016 12:26:32.815 -0500 INFO  S2SFileReceiver - event=onFileAborted bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E remoteError=false
04-19-2016 12:26:32.815 -0500 INFO  CMReplicationRegistry - Finished replication: bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E target=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:32.815 -0500 INFO  CMSlave - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E tgt=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 failing=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 queued replication error job
04-19-2016 12:26:32.815 -0500 INFO  S2SFileReceiver - event=onFileAborted replicationType=eJournalReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E bucketType=cold remoteError=false status='success'
04-19-2016 12:26:32.847 -0500 INFO  CMRepJob - job=CMReplicationErrorJob bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 failingGuid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 srcGuid=E6B3EBCE-6024-4A1E-9CC6-3237336E287E tgtGuid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 succeeded
04-19-2016 12:26:35.428 -0500 INFO  CMSlave - event=addBucket bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 status=NonStreamingTarget ss=Unsearchable mask=0 earliest=0 latest=0 standalone=0
04-19-2016 12:26:35.428 -0500 INFO  CMSlave - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 addTargetInProgress=true
04-19-2016 12:26:35.428 -0500 INFO  CMReplicationRegistry - Starting replication: bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E target=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:35.428 -0500 INFO  S2SFileReceiver - event=onFileOpened replicationType=eJournalReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E bucketType=cold path=/splunkdata/var/lib/splunk/mu-syslog/colddb/441_BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6/rawdata/journal.gz searchable=false
04-19-2016 12:26:35.447 -0500 INFO  CMSlave - addTargetDone bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 status=success addTargetInProgress=false
04-19-2016 12:26:36.750 -0500 INFO  S2SFileReceiver - event=onDoneReceived replicationType=eJournalReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:36.751 -0500 INFO  S2SFileReceiver - event=rename bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 from=/splunkdata/var/lib/splunk/mu-syslog/colddb/441_BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to=/splunkdata/var/lib/splunk/mu-syslog/colddb/rb_1459818423_1459812147_441_BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:36.751 -0500 INFO  CMSlave - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 Transitioning status from=NonStreamingTarget to=Complete for reason="cold success (target)"
04-19-2016 12:26:36.751 -0500 INFO  DatabaseDirectoryManager - addReplicatedBucket bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 dstPath='/splunkdata/var/lib/splunk/mu-syslog/colddb/rb_1459818423_1459812147_441_BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6'
04-19-2016 12:26:36.751 -0500 ERROR S2SFileReceiver - event=onFileClosed replicationType=eJournalReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 state=eComplete src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E bucketType=cold status=failed err="bucket is already registered, registered not as a streaming hot target (SPL-90606)"
04-19-2016 12:26:36.751 -0500 WARN  S2SFileReceiver - event=processFileSlice bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 msg='aborting on local error'

Indexer 2

04-19-2016 12:26:30.571 -0500 INFO  CMReplicationRegistry - Finished replication: bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E target=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:30.571 -0500 INFO  CMSlave - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E tgt=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 failing=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 queued replication error job
04-19-2016 12:26:30.580 -0500 INFO  CMRepJob - job=CMReplicationErrorJob bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 failingGuid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 srcGuid=E6B3EBCE-6024-4A1E-9CC6-3237336E287E tgtGuid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 succeeded
04-19-2016 12:26:31.341 -0500 INFO  CMReplicationRegistry - Starting replication: bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E target=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.341 -0500 INFO  BucketReplicator - event=asyncReplicateBucket bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to guid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 host=xxx.xxx.xx.xxx s2sport=8090
04-19-2016 12:26:31.341 -0500 INFO  BucketReplicator - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 earliest=1459812147 latest=1459818423 type=3
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - Created asyncReplication task to replicate bucket mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to guid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 host=xxx.xxx.xx.xxx s2sport=8090 bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - event=startBucketReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - Starting replication of bucket=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to 128.206.15.196:8090;
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - Replicating warm bucket=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 node=guid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 host=xxx.xxx.xx.xxx s2sport=8090 bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - event=finishBucketReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 [et=1459812147 lt=1459818423 type=3]
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - event=localReplicationFinished type=cold bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.354 -0500 INFO  BucketReplicator - Connection for idx=xxx.xxx.xx.xxx:8090:mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 successful
04-19-2016 12:26:32.818 -0500 WARN  BucketReplicator - Failed to replicate warm bucket bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to guid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 host=xxx.xxx.xx.xxx s2sport=8090. Connection closed.

Suggestions please!

Thanks.

Path Finder

I saw a similar issue;
Key Errors:

[-failing tgt-]# tail /opt/splunk/var/log/splunk/splunkd.log -f
04-28-2019 17:18:50.639 +0000 ERROR S2SFileReceiver - event=onFileClosed replicationType=eJournalReplication bid=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E state=eComplete src=B78B7685-1AEF-477F-B50C-BB65C1633777 bucketType=warm status=failed err="bucket is already registered, registered not as a streaming hot target (SPL-90606)"
04-28-2019 17:18:50.639 +0000 WARN S2SFileReceiver - event=processFileSlice bid=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E msg='aborting on local error'
04-28-2019 17:18:50.699 +0000 WARN CMSlave - event=addTargetDone bid=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E but we no longer have the bucket lets remove it from the master as well
04-28-2019 17:18:50.699 +0000 WARN CMSlave - deleting bucket=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E but failed to delete, reason=unable to find bucket
04-28-2019 17:18:50.882 +0000 INFO CMRepJob - job=CMReplicationErrorJob bid=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E failingGuid=14E1138E-A7E7-499E-A7AD-84BC5797B164 srcGuid=B78B7685-1AEF-477F-B50C-BB65C1633777 tgtGuid=14E1138E-A7E7-499E-A7AD-84BC5797B164 succeeded

[attempting least to most resistance affect]
First try;
enable Maintenance-mode; restart affected splunkd on IDXs
You can correlate the affected GUIDs by Master execution:
/opt/splunk/bin/splunk show cluster-status
[It gives output of hostname/GUID/Site]
disable Maintenance-mode

Second try:
Master>Bucket Status
Resync the non-tgt failing bucket

Third try;
Delete the non-tgt failing bucket

Fourth try;
Delete a copy[I felt ok deleting; “copy” where RF3]
Failing tgt was not an option;
~first did failing source, [bidGUID]
~second-then there was another option for another IDX at another site, did that, and finally it stopped bouncing and the Bucket status error /fix-up finished; [srcGuid]
Reviewing now;Tailing splunkd on all the peers; initially I was only seeing paying attention to ERRORs WRT [tgtGuid] and [bid
GUID]; now that I am looking at it in review there is an INFO log identifying the [srcGuid]

[I have seen this many times, when IT pulls the plug on my precious IDX peers; but thats why we have 3:2][usually another restart of single components in maintenance mode will bring ‘em back in the game]

This was on a 4 peer Cluster with RF3 SF2 running 7.0.3
In my situation the affected tgt was bouncing in :
/opt/splunk/bin/splunk show cluster-status and MasterGUI>Settings>IndexerClustering
Searchable NO
Status Stopped
After IT Support applied some updates and rebooted my IDXs.

0 Karma

Explorer

I don't exactly recall how I fixed it, but I did manage to clear the errors. so this can be closed now

0 Karma

SplunkTrust
SplunkTrust

To close just accept your answer...

0 Karma

Motivator

Is this still a mystery?

0 Karma

Splunk Employee
Splunk Employee

@lycollicott -- I asked the search cluster team and they said this post only has a single error referenced and that it's due to the search processing language (SPL), while the rest are info messages. If you are still getting issues you can either comment/elaborate or create a new question.

0 Karma

Motivator

I'm sorry, but that makes no sense. 😕

0 Karma

Splunk Employee
Splunk Employee

Hey! They clarified further and said "In the single ERROR message at the end, it contains "SPL-90606" -- this references a JIRA bug" and suggested a support ticket.

0 Karma

Contributor

Just to clarify: the single ERROR line contains the cause "SPL-90606". This references a bug number in our database.

0 Karma

Path Finder

I'm confused. How can a bucket replication error on an indexer cluster be caused by SPL?

0 Karma