Deployment Architecture

Why does a cluster halt with - ERROR BucketReplicator - Could not find size of file.. ?

ddrillic
Ultra Champion

Our indexer cluster got all its queues filled up yesterday and no data got indexed for around 10 hours.

Support determined the root cause to be the issue reflected by this type of messages -

12-19-2017 05:59:43.776 -0600 ERROR BucketReplicator - Could not find size of file=/SplunkIndexData/splunk-indexes/<db name>/colddb/rb_1508261105_1508249969_824_EE604247-781F-45EF-B2C4-C966B93CE78C/rawdata/journal.gz for bid=<db name>~824~EE604247-781F-45EF-B2C4-C966B93CE78C. stat() failed. No such file or directory.

Using index=_internal "No such file or directory" | dedup file | table host we identified and then removed the broken buckets.

I just wonder what could have caused this issue and why the cluster would stop functioning with this type of issue.

Tags (1)
0 Karma
1 Solution

mchang_splunk
Splunk Employee
Splunk Employee

It's mostly system outage or splunkd was killed while buckets was updated.
when you check the folder file=/SplunkIndexData/splunk-indexes//colddb/rb_1508261105_1508249969_824_EE604247-781F-45EF-B2C4-C966B93CE78C, subfolder rawdata disappeared.

Since the raw data was missing, the only solution is to remove the whole buckets to get replication process working.
If RF is larger than 2, you should have these buckets replicated from other cluster peers without data loss.

View solution in original post

0 Karma

mchang_splunk
Splunk Employee
Splunk Employee

It's mostly system outage or splunkd was killed while buckets was updated.
when you check the folder file=/SplunkIndexData/splunk-indexes//colddb/rb_1508261105_1508249969_824_EE604247-781F-45EF-B2C4-C966B93CE78C, subfolder rawdata disappeared.

Since the raw data was missing, the only solution is to remove the whole buckets to get replication process working.
If RF is larger than 2, you should have these buckets replicated from other cluster peers without data loss.

0 Karma

ddrillic
Ultra Champion

Much appreciated @mchang!

And my point is that the entire cluster is stalled due to some corrupt buckets.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In the last month, the Splunk Threat Research Team (STRT) has had 1 release of new security content via the ...

There's No Place Like Chrome and the Splunk Platform

Watch On DemandMalware. Risky Extensions. Data Exfiltration. End-users are increasingly reliant on browsers to ...

The Great Resilience Quest: 5th Leaderboard Update

The fifth leaderboard update for The Great Resilience Quest is out &gt;&gt; &#x1f3c6; Check out the ...