In the 5.x code (up to 5.0.2), there is a painful bug with CM. That bug is related to CM not persisting knowledge of frozen buckets across restarts (Bug ID: SPL-65100). This results in the CM kicking off dozens (hundreds? thousands?) of bucket fixup processes on the indexers. Until enough buckets are fixed up that the index becomes searchable again, there is a 100% search outage. Bad.
To speed up the recovery requires digging into the configs and altering 2 settings, based on the "size" of the indexers available. In our case, the indexers are 24 core, and can handle up to 24 concurrent fixup jobs and splunk-optimize processes. Since we are on a very fast SAN, there is almost no iowait for storage.
Note that you will want to adjust these settings based on the size/capacity of your indexer CPU core count & available IOPS. These processes can be very IO intensive.
The settings required to make this go faster are
On the CM, in /opt/splunk_clustermaster/etc/system/local/server.conf
[clustering]
max_peer_build_load = 23
On each of the indexers, and thus managed by the CM bundle in /opt/splunk_clustermaster/etc/master-apps/_cluster/local/indexes.conf
[default]
maxRunningProcessGroups = 23
Make sure the indexers are restarted to pick up this change. When the CM restarts, it will pick up the change. After a CM restart, bucket fixup will run massively parallel across all indexers and bury them. That's okay, because when there is a 100% search outage, USE ALL THE RESOURCES! to get search back online.
Disclaimer: Splunk support may not approve of these changes. Splunk support also doesn't have to take the lashing the local Splunk admin does when Splunk is unavailable for searching. Be careful. 🙂
... View more