Hi,
I'm currently running Splunk 7.3.0 and have 32 indexes running in a single cluster with 2 peers.
Indexes are being replicated across both peers.
Everything was working fine until we experienced a network blip 12 days ago, now I've noticed that the Replication Factor is not being met because there are some buckets from this time period which don't match, an average of about 3 buckets.
I've tried to Roll, Resync and Delete these buckets via the GUI but each step fails. When I check splunkd.log, it appears as if Splunk is automatically trying to recover from these Fix Up tasks but it keeps reporting that the bucket is still in flight so can't.
04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete
04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket"
04-06-2021 08:07:39.618 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.618 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket"
04-06-2021 08:07:39.618 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight'
04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete
04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket"
04-06-2021 08:07:39.619 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.619 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket"
04-06-2021 08:07:39.619 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight'
04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucketExists=1
04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucketExists=1
Because of this, the Generation ID is also increasing quite rapidly. The status for all the buckets in question is stuck on 'PendingDiscard'.
The same messages are appearing on the second node but with different bucket IDs. The same ID's keep repeating every few seconds on both peers.
Should I restart each peer one at a time in hope that the bucket status is released and the fix up jobs can run as normal?
Do I need to restart the cluster master?
Any advice is appreciated.
Thank you
You must be sure that those buckets are not the only one inside the cluster. Make sure there is another bucket with the same name "rb" or "db" before deleting. Yes Cluster Master will make them replicated after restart the particular peer.
Since db names one has wider time-range it seems safe to delete rb bucket. It will be created on peer restart.
That worked perfectly, thank you...
Thank you greatly, I will be performing the peer restarts in around 7 hours time.
I'll let you know if it works and upvote accordingly.
You must be sure that those buckets are not the only one inside the cluster. Make sure there is another bucket with the same name "rb" or "db" before deleting. Yes Cluster Master will make them replicated after restart the particular peer.
The bucket ID is the same, but the range at the beginning is not.
Example
Bucket ID: _audit~110~25359C10-2544-436D-893A-657C950D7863
Peer 1 Folder Name: rb_1614134403_1612300619_110_25359C10-2544-436D-893A-657C950D7863
Peer 2 Folder Name: db_1614134587_1612300619_110_25359C10-2544-436D-893A-657C950D7863
All neighbouring folder names match perfectly, it's only the buckets in question that don't match. If I remove the RB folder, will it get re-created with the correct DB equivalent?
It seems, the only buckets affected are the replicated ones (rb instead of db).
If I manually removed these before restarting the Cluster Master (and peers) will they just be re-created?
Hi @richardgosnay,
You should rename those buckets by removing inflight- in front of the bucket name.
If I manually locate the buckets in question, they don't have inflight- in the filename, they appear as normal buckets. But every time I try to run a fix up task like Roll, Resync or Delete, the log files states it is in flight (see previous log snippet).
Should I try running the fix up tasks in maintenance mode?
Can you receive some error from the splunk platform?
if yes, can you show me the error?
last question are you sure the buckets ID are not duplicated?
The only errors in Splunk are the same as the ones in the splunkd.log file, you can see the snippet in the original post.