I have Clustered Spunk environment (also called as bucket replication) with
--One Cluster Master
--Five cluster Peers
--Search Head.
One of our Cluster Peer ran out of Disk Space for partition holding hot+warm buckets- as a result some bad buckets were created.
We have resolved the disk issue and now Cluster master is reporting some bad buckets and as a result search factor and Replication factors are not met.
Messages such as this one appear as warnings on the Cluster Master:
Search peer indexer01.example.com has the following message: Failed to make bucket = improbable_logs~1368~D823EFB4-14AA-4C97-9500-E21A12608EC4 searchable, retry count = 13.
This is Splunk Version 6.1.4
Since you already know the root cause of these bad buckets and if you have already analyzed and concluded that these buckets cannot be recovered, you could delete these buckets using the command listed below
For our discussion let say that bad bucket to be deleted is for index=_audit and bucket id is "_audit~1~350142A5-6AFF-4852-A45C-2A7CDF8FE540"
To delete this bucket, on the cluster Master Splunk command
First put the cluster Master in Maintenance mode
$SPLUNK_HOME/bin/splunk enable maintenance-mode
Use the command below to delete the bucket. Note this command from the Cluster Master will physically delete the buckets from all the peer.
$SPLUNK_HOME/bin/splunk _internal call /services/cluster/master/buckets/_audit~1~350142A5-6AFF-4852-A45C-2A7CDF8FE540/remove_all -method POST
Disable cluster Master from Maintenance mode
./splunk disable maintenance-mode
Navigate to the index and check the bucket is deleted.
Since you already know the root cause of these bad buckets and if you have already analyzed and concluded that these buckets cannot be recovered, you could delete these buckets using the command listed below
For our discussion let say that bad bucket to be deleted is for index=_audit and bucket id is "_audit~1~350142A5-6AFF-4852-A45C-2A7CDF8FE540"
To delete this bucket, on the cluster Master Splunk command
First put the cluster Master in Maintenance mode
$SPLUNK_HOME/bin/splunk enable maintenance-mode
Use the command below to delete the bucket. Note this command from the Cluster Master will physically delete the buckets from all the peer.
$SPLUNK_HOME/bin/splunk _internal call /services/cluster/master/buckets/_audit~1~350142A5-6AFF-4852-A45C-2A7CDF8FE540/remove_all -method POST
Disable cluster Master from Maintenance mode
./splunk disable maintenance-mode
Navigate to the index and check the bucket is deleted.
You are correct- delete wil cause it to lose data. Log a Splunk Support Case.
One thing to watch out for in splunkd.log on the CM when performing the removal is
02-11-2015 09:26:16.386 -0600 WARN CMMaster - did not schedule removal for peer=...
It would appear that perhaps a fsck or other activity on the peer prevented removal although the REST call returned a 200. In my case, when the peers were restarted, the damaged buckets began replicating again.
Making the same call a few times while watching for the absence of that error in splunkd.log did the trick for me.
By deleting the bucket, the data will be lost correct? Is there no alternate without loosing the raw data ?