Deployment Architecture

Why am I getting these failed bucket replication errors on each indexer in a cluster?

klutzen
Explorer

I have two indexers set for a 2:2 configuration for replication/search factor. All has been fine until a couple of weeks ago when an error crept in. The problem began before I upgraded the cluster to 6.4.0 on Sunday and I've been trying to sort this out. Logs snippet that keep repeating from splunkd.log on each indexer are below:

Indexer 1

04-19-2016 12:26:32.815 -0500 INFO  S2SFileReceiver - event=onFileAborted bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E remoteError=false
04-19-2016 12:26:32.815 -0500 INFO  CMReplicationRegistry - Finished replication: bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E target=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:32.815 -0500 INFO  CMSlave - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E tgt=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 failing=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 queued replication error job
04-19-2016 12:26:32.815 -0500 INFO  S2SFileReceiver - event=onFileAborted replicationType=eJournalReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E bucketType=cold remoteError=false status='success'
04-19-2016 12:26:32.847 -0500 INFO  CMRepJob - job=CMReplicationErrorJob bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 failingGuid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 srcGuid=E6B3EBCE-6024-4A1E-9CC6-3237336E287E tgtGuid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 succeeded
04-19-2016 12:26:35.428 -0500 INFO  CMSlave - event=addBucket bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 status=NonStreamingTarget ss=Unsearchable mask=0 earliest=0 latest=0 standalone=0
04-19-2016 12:26:35.428 -0500 INFO  CMSlave - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 addTargetInProgress=true
04-19-2016 12:26:35.428 -0500 INFO  CMReplicationRegistry - Starting replication: bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E target=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:35.428 -0500 INFO  S2SFileReceiver - event=onFileOpened replicationType=eJournalReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E bucketType=cold path=/splunkdata/var/lib/splunk/mu-syslog/colddb/441_BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6/rawdata/journal.gz searchable=false
04-19-2016 12:26:35.447 -0500 INFO  CMSlave - addTargetDone bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 status=success addTargetInProgress=false
04-19-2016 12:26:36.750 -0500 INFO  S2SFileReceiver - event=onDoneReceived replicationType=eJournalReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:36.751 -0500 INFO  S2SFileReceiver - event=rename bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 from=/splunkdata/var/lib/splunk/mu-syslog/colddb/441_BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to=/splunkdata/var/lib/splunk/mu-syslog/colddb/rb_1459818423_1459812147_441_BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:36.751 -0500 INFO  CMSlave - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 Transitioning status from=NonStreamingTarget to=Complete for reason="cold success (target)"
04-19-2016 12:26:36.751 -0500 INFO  DatabaseDirectoryManager - addReplicatedBucket bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 dstPath='/splunkdata/var/lib/splunk/mu-syslog/colddb/rb_1459818423_1459812147_441_BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6'
04-19-2016 12:26:36.751 -0500 ERROR S2SFileReceiver - event=onFileClosed replicationType=eJournalReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 state=eComplete src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E bucketType=cold status=failed err="bucket is already registered, registered not as a streaming hot target (SPL-90606)"
04-19-2016 12:26:36.751 -0500 WARN  S2SFileReceiver - event=processFileSlice bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 msg='aborting on local error'

Indexer 2

04-19-2016 12:26:30.571 -0500 INFO  CMReplicationRegistry - Finished replication: bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E target=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:30.571 -0500 INFO  CMSlave - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E tgt=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 failing=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 queued replication error job
04-19-2016 12:26:30.580 -0500 INFO  CMRepJob - job=CMReplicationErrorJob bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 failingGuid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 srcGuid=E6B3EBCE-6024-4A1E-9CC6-3237336E287E tgtGuid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 succeeded
04-19-2016 12:26:31.341 -0500 INFO  CMReplicationRegistry - Starting replication: bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 src=E6B3EBCE-6024-4A1E-9CC6-3237336E287E target=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.341 -0500 INFO  BucketReplicator - event=asyncReplicateBucket bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to guid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 host=xxx.xxx.xx.xxx s2sport=8090
04-19-2016 12:26:31.341 -0500 INFO  BucketReplicator - bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 earliest=1459812147 latest=1459818423 type=3
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - Created asyncReplication task to replicate bucket mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to guid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 host=xxx.xxx.xx.xxx s2sport=8090 bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - event=startBucketReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - Starting replication of bucket=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to 128.206.15.196:8090;
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - Replicating warm bucket=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 node=guid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 host=xxx.xxx.xx.xxx s2sport=8090 bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - event=finishBucketReplication bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 [et=1459812147 lt=1459818423 type=3]
04-19-2016 12:26:31.342 -0500 INFO  BucketReplicator - event=localReplicationFinished type=cold bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6
04-19-2016 12:26:31.354 -0500 INFO  BucketReplicator - Connection for idx=xxx.xxx.xx.xxx:8090:mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 successful
04-19-2016 12:26:32.818 -0500 WARN  BucketReplicator - Failed to replicate warm bucket bid=mu-syslog~441~BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 to guid=BBF7C0FC-BC6B-48FE-8E54-DD93348F29F6 host=xxx.xxx.xx.xxx s2sport=8090. Connection closed.

Suggestions please!

Thanks.

GDustin
Path Finder

I saw a similar issue;
Key Errors:

[-failing tgt-]# tail /opt/splunk/var/log/splunk/splunkd.log -f
04-28-2019 17:18:50.639 +0000 ERROR S2SFileReceiver - event=onFileClosed replicationType=eJournalReplication bid=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E state=eComplete src=B78B7685-1AEF-477F-B50C-BB65C1633777 bucketType=warm status=failed err="bucket is already registered, registered not as a streaming hot target (SPL-90606)"
04-28-2019 17:18:50.639 +0000 WARN S2SFileReceiver - event=processFileSlice bid=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E msg='aborting on local error'
04-28-2019 17:18:50.699 +0000 WARN CMSlave - event=addTargetDone bid=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E but we no longer have the bucket lets remove it from the master as well
04-28-2019 17:18:50.699 +0000 WARN CMSlave - deleting bucket=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E but failed to delete, reason=unable to find bucket
04-28-2019 17:18:50.882 +0000 INFO CMRepJob - job=CMReplicationErrorJob bid=unix~564~3980C11F-0463-420B-8584-F58CF055EC0E failingGuid=14E1138E-A7E7-499E-A7AD-84BC5797B164 srcGuid=B78B7685-1AEF-477F-B50C-BB65C1633777 tgtGuid=14E1138E-A7E7-499E-A7AD-84BC5797B164 succeeded

[attempting least to most resistance affect]
First try;
enable Maintenance-mode; restart affected splunkd on IDXs
You can correlate the affected GUIDs by Master execution:
/opt/splunk/bin/splunk show cluster-status
[It gives output of hostname/GUID/Site]
disable Maintenance-mode

Second try:
Master>Bucket Status
Resync the non-tgt failing bucket

Third try;
Delete the non-tgt failing bucket

Fourth try;
Delete a copy[I felt ok deleting; “copy” where RF3]
Failing tgt was not an option;
~first did failing source, [bid_GUID]
~second-then there was another option for another IDX at another site, did that, and finally it stopped bouncing and the Bucket status error /fix-up finished; [srcGuid]
Reviewing now;Tailing splunkd on all the peers; initially I was only seeing paying attention to ERRORs WRT [tgtGuid] and [bid_GUID]; now that I am looking at it in review there is an INFO log identifying the [srcGuid]

[I have seen this many times, when IT pulls the plug on my precious IDX peers; but thats why we have 3:2][usually another restart of single components in maintenance mode will bring ‘em back in the game]

This was on a 4 peer Cluster with RF3 SF2 running 7.0.3
In my situation the affected tgt was bouncing in :
/opt/splunk/bin/splunk show cluster-status and MasterGUI>Settings>IndexerClustering
Searchable NO
Status Stopped
After IT Support applied some updates and rebooted my IDXs.

0 Karma

klutzen
Explorer

I don't exactly recall how I fixed it, but I did manage to clear the errors. so this can be closed now

0 Karma

gjanders
SplunkTrust
SplunkTrust
0 Karma

lycollicott
Motivator

Is this still a mystery?

0 Karma

lfedak_splunk
Splunk Employee
Splunk Employee

@lycollicott -- I asked the search cluster team and they said this post only has a single error referenced and that it's due to the search processing language (SPL), while the rest are info messages. If you are still getting issues you can either comment/elaborate or create a new question.

0 Karma

lycollicott
Motivator

I'm sorry, but that makes no sense. 😕

0 Karma

lfedak_splunk
Splunk Employee
Splunk Employee

Hey! They clarified further and said "In the single ERROR message at the end, it contains "SPL-90606" -- this references a JIRA bug" and suggested a support ticket.

0 Karma

nnmiller
Contributor

Just to clarify: the single ERROR line contains the cause "SPL-90606". This references a bug number in our database.

0 Karma

baldwintm
Path Finder

I'm confused. How can a bucket replication error on an indexer cluster be caused by SPL?

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...