Deployment Architecture

Splunk 5.x Cluster Manager restart results in 100% search outage

davidpaper
Contributor

Do you have Splunk 5.0, 5.0.1 or 5.0.2 and have chosen to run clustered peers? I bet you've noticed that when you have a lot of buckets (1000s) and the Cluster Master process restarts, that there is a 100% search outage for a while. Then, it magically fixes itself. It's not magic. It's Splunk! Why is this happening?

1 Solution

davidpaper
Contributor

In the 5.x code (up to 5.0.2), there is a painful bug with CM. That bug is related to CM not persisting knowledge of frozen buckets across restarts (Bug ID: SPL-65100). This results in the CM kicking off dozens (hundreds? thousands?) of bucket fixup processes on the indexers. Until enough buckets are fixed up that the index becomes searchable again, there is a 100% search outage. Bad.

To speed up the recovery requires digging into the configs and altering 2 settings, based on the "size" of the indexers available. In our case, the indexers are 24 core, and can handle up to 24 concurrent fixup jobs and splunk-optimize processes. Since we are on a very fast SAN, there is almost no iowait for storage.

Note that you will want to adjust these settings based on the size/capacity of your indexer CPU core count & available IOPS. These processes can be very IO intensive.

The settings required to make this go faster are
On the CM, in /opt/splunk_clustermaster/etc/system/local/server.conf


[clustering]
max_peer_build_load = 23

On each of the indexers, and thus managed by the CM bundle in /opt/splunk_clustermaster/etc/master-apps/_cluster/local/indexes.conf

[default]
maxRunningProcessGroups = 23

Make sure the indexers are restarted to pick up this change. When the CM restarts, it will pick up the change. After a CM restart, bucket fixup will run massively parallel across all indexers and bury them. That's okay, because when there is a 100% search outage, USE ALL THE RESOURCES! to get search back online.

Disclaimer: Splunk support may not approve of these changes. Splunk support also doesn't have to take the lashing the local Splunk admin does when Splunk is unavailable for searching. Be careful. 🙂

View solution in original post

davidpaper
Contributor

In the 5.x code (up to 5.0.2), there is a painful bug with CM. That bug is related to CM not persisting knowledge of frozen buckets across restarts (Bug ID: SPL-65100). This results in the CM kicking off dozens (hundreds? thousands?) of bucket fixup processes on the indexers. Until enough buckets are fixed up that the index becomes searchable again, there is a 100% search outage. Bad.

To speed up the recovery requires digging into the configs and altering 2 settings, based on the "size" of the indexers available. In our case, the indexers are 24 core, and can handle up to 24 concurrent fixup jobs and splunk-optimize processes. Since we are on a very fast SAN, there is almost no iowait for storage.

Note that you will want to adjust these settings based on the size/capacity of your indexer CPU core count & available IOPS. These processes can be very IO intensive.

The settings required to make this go faster are
On the CM, in /opt/splunk_clustermaster/etc/system/local/server.conf


[clustering]
max_peer_build_load = 23

On each of the indexers, and thus managed by the CM bundle in /opt/splunk_clustermaster/etc/master-apps/_cluster/local/indexes.conf

[default]
maxRunningProcessGroups = 23

Make sure the indexers are restarted to pick up this change. When the CM restarts, it will pick up the change. After a CM restart, bucket fixup will run massively parallel across all indexers and bury them. That's okay, because when there is a 100% search outage, USE ALL THE RESOURCES! to get search back online.

Disclaimer: Splunk support may not approve of these changes. Splunk support also doesn't have to take the lashing the local Splunk admin does when Splunk is unavailable for searching. Be careful. 🙂

davidpaper
Contributor

An interesting caveat to this -- if you alter the settings above to shorten the search outage when a CM restarts, it also impacts how aggressive the CM is when taking an indexing peer out of rotation. When it rebalances the cluster (making buckets searchable, replicating buckets to a new peer), it takes advantage of these settings. CPU usage goes way up during the rebalance.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...