After the Splunk Master enters maintenance mode, one of the indexers goes offline and then back online, and disables maintenance mode. The fixup tasks get stuck for about a week. The number of fixup tasks pending goes from around 5xx to 102 (after deleting rb bucket. I assume its the issue of bucket syncing in indexer cluster because client's server is a bit laggy(network delay, low cpu))
There are 40 fixup tasks in progress and 102 fixup tasks pending in the indexer cluster master.
The internal log shows that all those 40 tasks are displaying the following error:
Getting size on disk: Unable to get size on disk for bucket id=xxxxxxxxxxxxx path="/splunkdata/windows/db/rb_xxxxxx" (This is usually harmless as we may be racing with a rename in BucketMover or the S2SFileReceiver thread, or merge-buckets command which should be obvious in log file; the previous WARN message about this path can safely be ignored.) caller=serialize_SizeOnDisk
Delete dir exists, or failed to sync search files for bid=xxxxxxxxxxxxxxxxxxx; will build bucket locally. err= Failed to sync search files for bid=xxxxxxxxxxxxxxxxxxx from srcs=xxxxxxxxxxxxxxxxxxxxxxx
CMSlave [6205 CallbackRunnerThread] - searchState transition bid=xxxxxxxxxxxxxxxxxxxxx from=PendingSearchable to=Unsearchable reason='fsck failed: exitCode=24 (procId=1717942)'
Getting size on disk: Unable to get size on disk for bucket id=xxxxxxxxxxxxx path="/splunkdata/windows/db/rb_xxxxxx" (This is usually harmless as we may be racing with a rename in BucketMover or the S2SFileReceiver thread, or merge-buckets command which should be obvious in log file; the previous WARN message about this path can safely be ignored.) caller=serialize_SizeOnDisk
The internal log shows that all those 102 tasks are displaying the following error:
ERROR TcpInputProc [6291 ReplicationDataReceiverThread] - event=replicationData status=failed err="Could not open file for bid=windows~xxxxxx err="bucket is already registered with this peer" (Success)"
Does anyone know what "fsck failed exit code 24" and "bucket is already registered with this peer" mean? How can these issues be resolved to reduce the number of fixup tasks? Thanks.
An update to this old topic since it takes time to apply for performing a restart for my client. I fixed the issue by performing a bundle restart in the Splunk cluster master. I also increased the "max_peer_build_load" and "max_peer_rep_load" values in the server.conf file to clear up existing bucket fixup tasks more quickly. Still not sure what "fsck failed exit code 24" means tho. Probably just an issue of network delay or low cpu.
An update to this old topic since it takes time to apply for performing a restart for my client. I fixed the issue by performing a bundle restart in the Splunk cluster master. I also increased the "max_peer_build_load" and "max_peer_rep_load" values in the server.conf file to clear up existing bucket fixup tasks more quickly. Still not sure what "fsck failed exit code 24" means tho. Probably just an issue of network delay or low cpu.
"Bucket is already registered with the peer" means during bucket replication, that indexer peer attempted to replicate a bucket to another peer, but the target peer already has that bucket registered possibly as a primary or searchable copy. Therefore, it refuses to overwrite or duplicate it.
run the below rest command and check the health of the cluster
| rest /services/cluster/master/buckets | table title, bucket_flags, replication_count, search_count, status
and check for any standalone bucket issue, that also may be the reason