Splunk Enterprise

Corrupted buckets in Splunk Indexer cluster (Splunk 9.4.3)

PRusconi91
Engager

Hello,

we have a cluster of 6 indexers distributed on 2 sites and after a patching activity on these servers that required a restart of the indexers one by one (before we put the cluster in maintenance mode and then we remove it) we started to face an issue on some buckets. 

This is causing a failure in reaching both the Search Factor (SF=2) and the Replication Factor (RF=2).

From the Bucket Status dashboard of the Monitoring Console we see that the only one specific peer has the buckets with:

  • Status: Complete / Search State: Searchable

while on the others:

  • Status: NonStreamingTarget / Search State: Unsearchable

We checked that the the configuration for these indexes are the same on all the indexers, in order to exclude problems due to misconfiguration.

Looking in the splunkd.log of the interested indexer we found the following errors for many buckets:

"Corrupt bucket report: bid=XXX error='Error while trying to search bucket=XXX (error='Failed to read compression bits from bucket=XXX - exception thrown: JournalSliceDirectory: Cannot seek to rawadata offet 0, path="XXX/rawdata" Please check/repair bucket path='XXX' wih 'fsck' as it could be corrupted.') Results may be incomplete!"

We saw the from the Bucket status dashboard selecting "roll" and "resync" nothing happened.

We did not try "Delete Copy" because we were not sure if it would delete the bucket only on the problematic indexer or on all indexers. Could you please confirm what would be delete with "Delete Copy"?

Otherwise, in order to fix this issue, we could:

  • Put the cluster in maintenance mode
  • Stop Splunk on the interested peer
  • Remove the corrupted bucket
  • Start Splunk again on the indexer
  • Force the resync from the CM

Is this the actual correct approach to fix the problem?

We would like to be sure before proceeding with the delete of the buckets because we do not want to lose the data.

Thanks!

Labels (2)

PRusconi91
Engager

I thank you so far for you answers!

We checked the status of the corrupted buckets using the command dbinspect (temporal filter: all time) and this is what we discovered (see screenshot in attachment).  Within the cluster the buckets are not syncing and on only one indexer they rolled to "warm".

We tried again "Roll", "Resync" and "Delete Copy" among the available actions from the Bucket Status dashboard, however nothing changed.

0 Karma

kknairr
Contributor

@PRusconi91 Thanks for the detailed note, I guess it's a Splunk indexer cluster bucket corruption scenario which we faced previously. The below section might help on your query for Delete Copy query.

Marking the answer and giving Karma helps others find solutions faster!

From Splunk: 

"You can delete either a single copy of a bucket on a specific peer, or all copies of a bucket across the entire cluster.

If deleting a single copy causes the cluster to lose its complete state, the cluster will engage in fixup activities so that the bucket again meets both the search factor and the replication factor. This situation might result in another copy of the bucket appearing on the same peer.
If, however, the specified bucket is frozen, the cluster does not attempt any fixup activities." 

Ref: Anomalous bucket issues | Splunk Enterprise  

 

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Close but a bit inaccurate.

A frozen bucket should not reside on a hot/warm or cold storage. So if it should have been already but it is there something must have gone wrong during freezing. Anyway, a frozen bucket is not searchable so in a typical scenario (no frozen storage configured so buckets are deleted when rolled to frozen), deleting a bucket which should have been frozen but wasn't doesn't hurt you.

0 Karma

MichaelScott
New Member

Have you gone down this path of checking?
path="XXX/rawdata" Please check/repair bucket path='XXX' wih 'fsck' as it could be corrupted.


heres what I'd do. 
Put the cluster in maintenance mode.
stop splunk on the indexer having this issue.
see what mount/filesystem splunks on 

lsblk -f
df -Th
mount | grep splunk

found where Splunk is installed

at see what this reports

sudo fsck -n /dev/blockDevicSplunkIsInstalledOn
that just checks only.
For YOLO
sudo fsck /dev/blockDevicSplunkIsInstalledOn

regardless, at least take a snapshot of the server or cp -r /opt/splunk and move it off server. 
let us know what you find.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Close but the Splunk message is referring to the "splunk fsck" command, not the general filesystem fsck (which might also be in order but that's another story).

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...

Design, Compete, Win: Submit Your Best Splunk Dashboards for a .conf26 Pass

Hello Splunkers,  We’re excited to kick off a Splunk Dashboard contest! We know that dashboards are a primary ...

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...