Solved: Cluster Index Bucket Stuck as "In Flight" - Roll, ...

richardgosnay · ‎04-06-2021

Hi,

I'm currently running Splunk 7.3.0 and have 32 indexes running in a single cluster with 2 peers.

Indexes are being replicated across both peers.

Everything was working fine until we experienced a network blip 12 days ago, now I've noticed that the Replication Factor is not being met because there are some buckets from this time period which don't match, an average of about 3 buckets.

I've tried to Roll, Resync and Delete these buckets via the GUI but each step fails. When I check splunkd.log, it appears as if Splunk is automatically trying to recover from these Fix Up tasks but it keeps reporting that the bucket is still in flight so can't.

04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete
04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket"
04-06-2021 08:07:39.618 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.618 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket"
04-06-2021 08:07:39.618 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight'
04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete
04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket"
04-06-2021 08:07:39.619 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.619 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight
04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket"
04-06-2021 08:07:39.619 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight'
04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucketExists=1
04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucketExists=1

Because of this, the Generation ID is also increasing quite rapidly. The status for all the buckets in question is stuck on 'PendingDiscard'.

The same messages are appearing on the second node but with different bucket IDs. The same ID's keep repeating every few seconds on both peers.

Should I restart each peer one at a time in hope that the bucket status is released and the fix up jobs can run as normal?
Do I need to restart the cluster master?

Any advice is appreciated.

Thank you

scelikok · ‎04-07-2021

You must be sure that those buckets are not the only one inside the cluster. Make sure there is another bucket with the same name "rb" or "db" before deleting. Yes Cluster Master will make them replicated after restart the particular peer.

If this reply helps you an upvote and "Accept as Solution" is appreciated.

View solution in original post

scelikok · ‎04-07-2021

Since db names one has wider time-range it seems safe to delete rb bucket. It will be created on peer restart.

If this reply helps you an upvote and "Accept as Solution" is appreciated.

richardgosnay · ‎04-07-2021

That worked perfectly, thank you...

richardgosnay · ‎04-07-2021

Thank you greatly, I will be performing the peer restarts in around 7 hours time.

I'll let you know if it works and upvote accordingly.

scelikok · ‎04-07-2021

You must be sure that those buckets are not the only one inside the cluster. Make sure there is another bucket with the same name "rb" or "db" before deleting. Yes Cluster Master will make them replicated after restart the particular peer.

If this reply helps you an upvote and "Accept as Solution" is appreciated.

richardgosnay · ‎04-07-2021

The bucket ID is the same, but the range at the beginning is not.

Example

Bucket ID: _audit~110~25359C10-2544-436D-893A-657C950D7863
Peer 1 Folder Name: rb_1614134403_1612300619_110_25359C10-2544-436D-893A-657C950D7863
Peer 2 Folder Name: db_1614134587_1612300619_110_25359C10-2544-436D-893A-657C950D7863

All neighbouring folder names match perfectly, it's only the buckets in question that don't match. If I remove the RB folder, will it get re-created with the correct DB equivalent?

richardgosnay · ‎04-06-2021

It seems, the only buckets affected are the replicated ones (rb instead of db).

If I manually removed these before restarting the Cluster Master (and peers) will they just be re-created?

scelikok · ‎04-06-2021

Hi @richardgosnay,

You should rename those buckets by removing inflight- in front of the bucket name.

1- Put the CM in maintenance mode

2- Issue ./splunk offline on the IDX where the inflight bucket is located. .

3- Rename the inflight bucket to a normal bucket.

4- Turn back up the IDX with ./splunk start

5- Remove the maintenance mode on the CM.

6- After this, Splunk will replicate the bucket and move it to the coldPath

If this reply helps you an upvote and "Accept as Solution" is appreciated.

richardgosnay · ‎04-06-2021

If I manually locate the buckets in question, they don't have inflight- in the filename, they appear as normal buckets. But every time I try to run a fix up task like Roll, Resync or Delete, the log files states it is in flight (see previous log snippet).

Should I try running the fix up tasks in maintenance mode?

aasabatini · ‎04-06-2021

Hi @richardgosnay

Can you receive some error from the splunk platform?

if yes, can you show me the error?

last question are you sure the buckets ID are not duplicated?

“The answer is out there, Neo, and it’s looking for you, and it will find you if you want it to.”

richardgosnay · ‎04-06-2021

The only errors in Splunk are the same as the ones in the splunkd.log file, you can see the snippet in the original post.

Cluster Index Bucket Stuck as "In Flight" - Roll, Resync and Delete Fails (Status=PendingDiscard)

administration

troubleshooting

New Case Study Shows the Value of Partnering with Splunk Academic Alliance

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?