Hi, I'm currently running Splunk 7.3.0 and have 32 indexes running in a single cluster with 2 peers. Indexes are being replicated across both peers. Everything was working fine until we experienced a network blip 12 days ago, now I've noticed that the Replication Factor is not being met because there are some buckets from this time period which don't match, an average of about 3 buckets. I've tried to Roll, Resync and Delete these buckets via the GUI but each step fails. When I check splunkd.log, it appears as if Splunk is automatically trying to recover from these Fix Up tasks but it keeps reporting that the bucket is still in flight so can't. 04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete 04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket" 04-06-2021 08:07:39.618 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight 04-06-2021 08:07:39.618 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight 04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket" 04-06-2021 08:07:39.618 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight' 04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete 04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket" 04-06-2021 08:07:39.619 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight 04-06-2021 08:07:39.619 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight 04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket" 04-06-2021 08:07:39.619 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight' 04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucketExists=1 04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucketExists=1 Because of this, the Generation ID is also increasing quite rapidly. The status for all the buckets in question is stuck on 'PendingDiscard'. The same messages are appearing on the second node but with different bucket IDs. The same ID's keep repeating every few seconds on both peers. Should I restart each peer one at a time in hope that the bucket status is released and the fix up jobs can run as normal? Do I need to restart the cluster master? Any advice is appreciated. Thank you
... View more