Hello,
we have a cluster of 6 indexers distributed on 2 sites and after a patching activity on these servers that required a restart of the indexers one by one (before we put the cluster in maintenance mode and then we remove it) we started to face an issue on some buckets.
This is causing a failure in reaching both the Search Factor (SF=2) and the Replication Factor (RF=2).
From the Bucket Status dashboard of the Monitoring Console we see that the only one specific peer has the buckets with:
while on the others:
We checked that the the configuration for these indexes are the same on all the indexers, in order to exclude problems due to misconfiguration.
Looking in the splunkd.log of the interested indexer we found the following errors for many buckets:
"Corrupt bucket report: bid=XXX error='Error while trying to search bucket=XXX (error='Failed to read compression bits from bucket=XXX - exception thrown: JournalSliceDirectory: Cannot seek to rawadata offet 0, path="XXX/rawdata" Please check/repair bucket path='XXX' wih 'fsck' as it could be corrupted.') Results may be incomplete!"
We saw the from the Bucket status dashboard selecting "roll" and "resync" nothing happened.
We did not try "Delete Copy" because we were not sure if it would delete the bucket only on the problematic indexer or on all indexers. Could you please confirm what would be delete with "Delete Copy"?
Otherwise, in order to fix this issue, we could:
Is this the actual correct approach to fix the problem?
We would like to be sure before proceeding with the delete of the buckets because we do not want to lose the data.
Thanks!
I thank you so far for you answers!
We checked the status of the corrupted buckets using the command dbinspect (temporal filter: all time) and this is what we discovered (see screenshot in attachment). Within the cluster the buckets are not syncing and on only one indexer they rolled to "warm".
We tried again "Roll", "Resync" and "Delete Copy" among the available actions from the Bucket Status dashboard, however nothing changed.
@PRusconi91 Thanks for the detailed note, I guess it's a Splunk indexer cluster bucket corruption scenario which we faced previously. The below section might help on your query for Delete Copy query.
> Marking the answer and giving Karma helps others find solutions faster!
From Splunk:
"You can delete either a single copy of a bucket on a specific peer, or all copies of a bucket across the entire cluster.
If deleting a single copy causes the cluster to lose its complete state, the cluster will engage in fixup activities so that the bucket again meets both the search factor and the replication factor. This situation might result in another copy of the bucket appearing on the same peer.
If, however, the specified bucket is frozen, the cluster does not attempt any fixup activities."
Ref: Anomalous bucket issues | Splunk Enterprise
Close but a bit inaccurate.
A frozen bucket should not reside on a hot/warm or cold storage. So if it should have been already but it is there something must have gone wrong during freezing. Anyway, a frozen bucket is not searchable so in a typical scenario (no frozen storage configured so buckets are deleted when rolled to frozen), deleting a bucket which should have been frozen but wasn't doesn't hurt you.
Have you gone down this path of checking?
path="XXX/rawdata" Please check/repair bucket path='XXX' wih 'fsck' as it could be corrupted.
heres what I'd do.
Put the cluster in maintenance mode.
stop splunk on the indexer having this issue.
see what mount/filesystem splunks on
lsblk -f
df -Th
mount | grep splunkfound where Splunk is installed
at see what this reports
sudo fsck -n /dev/blockDevicSplunkIsInstalledOn
that just checks only.
For YOLO
sudo fsck /dev/blockDevicSplunkIsInstalledOnregardless, at least take a snapshot of the server or cp -r /opt/splunk and move it off server.
let us know what you find.
Close but the Splunk message is referring to the "splunk fsck" command, not the general filesystem fsck (which might also be in order but that's another story).