Getting Data In

Indexer Cluster Fixup Tasks Stuck (fsck failed: exitCode=24)(bucket is already registered)

azer271
Path Finder

After the Splunk Master enters maintenance mode, one of the indexers goes offline and then back online, and disables maintenance mode. The fixup tasks get stuck for about a week. The number of fixup tasks pending goes from around 5xx to 102 (after deleting rb bucket. I assume its the issue of bucket syncing in indexer cluster because client's server is a bit laggy(network delay, low cpu))

There are 40 fixup tasks in progress and 102 fixup tasks pending in the indexer cluster master.

The internal log shows that all those 40 tasks are displaying the following error:

Getting size on disk: Unable to get size on disk for bucket id=xxxxxxxxxxxxx path="/splunkdata/windows/db/rb_xxxxxx" (This is usually harmless as we may be racing with a rename in BucketMover or the S2SFileReceiver thread, or merge-buckets command which should be obvious in log file; the previous WARN message about this path can safely be ignored.) caller=serialize_SizeOnDisk

Delete dir exists, or failed to sync search files for bid=xxxxxxxxxxxxxxxxxxx; will build bucket locally. err= Failed to sync search files for bid=xxxxxxxxxxxxxxxxxxx from srcs=xxxxxxxxxxxxxxxxxxxxxxx

CMSlave [6205 CallbackRunnerThread] - searchState transition bid=xxxxxxxxxxxxxxxxxxxxx from=PendingSearchable to=Unsearchable reason='fsck failed: exitCode=24 (procId=1717942)'

Getting size on disk: Unable to get size on disk for bucket id=xxxxxxxxxxxxx path="/splunkdata/windows/db/rb_xxxxxx" (This is usually harmless as we may be racing with a rename in BucketMover or the S2SFileReceiver thread, or merge-buckets command which should be obvious in log file; the previous WARN message about this path can safely be ignored.) caller=serialize_SizeOnDisk

The internal log shows that all those 102 tasks are displaying the following error:

ERROR TcpInputProc [6291 ReplicationDataReceiverThread] - event=replicationData status=failed err="Could not open file for bid=windows~xxxxxx err="bucket is already registered with this peer" (Success)" 

Does anyone know what "fsck failed exit code 24" and "bucket is already registered with this peer" mean? How can these issues be resolved to reduce the number of fixup tasks? Thanks.

 

Labels (4)
0 Karma

thahir
Communicator

@azer271 

"Bucket is already registered with the peer" means during bucket replication, that indexer peer attempted to replicate a bucket to another peer, but the target peer already has that bucket registered possibly as a primary or searchable copy. Therefore, it refuses to overwrite or duplicate it.

run the below rest command and check the health of the cluster

| rest /services/cluster/master/buckets | table title, bucket_flags, replication_count, search_count, status

and check for any standalone bucket issue, that also may be the reason

0 Karma
Get Updates on the Splunk Community!

See just what you’ve been missing | Observability tracks at Splunk University

Looking to sharpen your observability skills so you can better understand how to collect and analyze data from ...

Weezer at .conf25? Say it ain’t so!

Hello Splunkers, The countdown to .conf25 is on-and we've just turned up the volume! We're thrilled to ...

How SC4S Makes Suricata Logs Ingestion Simple

Network security monitoring has become increasingly critical for organizations of all sizes. Splunk has ...