Splunk Enterprise

Cluster Index Bucket Stuck as "In Flight" - Roll, Resync and Delete Fails (Status=PendingDiscard)

richardgosnay
Explorer

Hi,

 

I'm currently running Splunk 7.3.0 and have 32 indexes running in a single cluster with 2 peers.

Indexes are being replicated across both peers.

 

Everything was working fine until we experienced a network blip 12 days ago, now I've noticed that the Replication Factor is not being met because there are some buckets from this time period which don't match, an average of about 3 buckets.

 

I've tried to Roll, Resync and Delete these buckets via the GUI but each step fails.  When I check splunkd.log, it appears as if Splunk is automatically trying to recover from these Fix Up tasks but it keeps reporting that the bucket is still in flight so can't.

04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete
04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket"
04-06-2021 08:07:39.618 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.618 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket"
04-06-2021 08:07:39.618 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight'
04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete
04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket"
04-06-2021 08:07:39.619 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.619 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket"
04-06-2021 08:07:39.619 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight'
04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucketExists=1
04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucketExists=1

Because of this, the Generation ID is also increasing quite rapidly.  The status for all the buckets in question is stuck on 'PendingDiscard'.

The same messages are appearing on the second node but with different bucket IDs.  The same ID's keep repeating every few seconds on both peers.

Should I restart each peer one at a time in hope that the bucket status is released and the fix up jobs can run as normal?
Do I need to restart the cluster master?

Any advice is appreciated.

 

Thank you

 

Labels (2)
Tags (2)
0 Karma
1 Solution

scelikok
Champion

You must be sure that those buckets are not the only one inside the cluster. Make sure there is another bucket with the same name "rb" or "db" before deleting. Yes Cluster Master will make them replicated after restart the particular peer.

If this reply helps you an upvote is appreciated.

View solution in original post

scelikok
Champion

Since db names one has wider time-range it seems safe to delete rb bucket. It will be created on peer restart.

If this reply helps you an upvote is appreciated.

richardgosnay
Explorer

That worked perfectly, thank you...

0 Karma

richardgosnay
Explorer

Thank you greatly, I will be performing the peer restarts in around 7 hours time.

 

I'll let you know if it works and upvote accordingly.

0 Karma

scelikok
Champion

You must be sure that those buckets are not the only one inside the cluster. Make sure there is another bucket with the same name "rb" or "db" before deleting. Yes Cluster Master will make them replicated after restart the particular peer.

If this reply helps you an upvote is appreciated.

View solution in original post

richardgosnay
Explorer

The bucket ID is the same, but the range at the beginning is not.

Example

Bucket ID: _audit~110~25359C10-2544-436D-893A-657C950D7863
Peer 1 Folder Name: rb_1614134403_1612300619_110_25359C10-2544-436D-893A-657C950D7863
Peer 2 Folder Name: db_1614134587_1612300619_110_25359C10-2544-436D-893A-657C950D7863

All neighbouring folder names match perfectly, it's only the buckets in question that don't match.  If I remove the RB folder, will it get re-created with the correct DB equivalent?

0 Karma

richardgosnay
Explorer

It seems, the only buckets affected are the replicated ones (rb instead of db).

 

If I manually removed these before restarting the Cluster Master (and peers) will they just be re-created?

0 Karma

scelikok
Champion

Hi @richardgosnay,

You should rename those buckets by removing inflight- in front of the bucket name.

1- Put the CM in maintenance mode
2- Issue ./splunk offline on the IDX where the inflight bucket is located. .
3- Rename the inflight bucket to a normal bucket.
4- Turn back up the IDX with ./splunk start
5- Remove the maintenance mode on the CM.
6- After this, Splunk will replicate the bucket and move it to the coldPath
If this reply helps you an upvote is appreciated.

richardgosnay
Explorer

If I manually locate the buckets in question, they don't have inflight- in the filename, they appear as normal buckets.  But every time I try to run a fix up task like Roll, Resync or Delete, the log files states it is in flight (see previous log snippet).

Should I try running the fix up tasks in maintenance mode?

0 Karma

aasabatini
Builder

Hi @richardgosnay 

Can you receive some error from the splunk platform?

if yes, can you show me the error?

last question are you sure the buckets ID are not duplicated?

 

0 Karma

richardgosnay
Explorer

The only errors in Splunk are the same as the ones in the splunkd.log file, you can see the snippet in the original post.

0 Karma
.conf21 CFS Extended through 5/20!

Don't miss your chance
to share your Splunk
wisdom in-person or
virtually at .conf21!

Call for Speakers has
been extended through
Thursday, 5/20!