Deployment Architecture

Splunk 5.x Cluster Manager restart results in 100% search outage

davidpaper
Contributor

Do you have Splunk 5.0, 5.0.1 or 5.0.2 and have chosen to run clustered peers? I bet you've noticed that when you have a lot of buckets (1000s) and the Cluster Master process restarts, that there is a 100% search outage for a while. Then, it magically fixes itself. It's not magic. It's Splunk! Why is this happening?

1 Solution

davidpaper
Contributor

In the 5.x code (up to 5.0.2), there is a painful bug with CM. That bug is related to CM not persisting knowledge of frozen buckets across restarts (Bug ID: SPL-65100). This results in the CM kicking off dozens (hundreds? thousands?) of bucket fixup processes on the indexers. Until enough buckets are fixed up that the index becomes searchable again, there is a 100% search outage. Bad.

To speed up the recovery requires digging into the configs and altering 2 settings, based on the "size" of the indexers available. In our case, the indexers are 24 core, and can handle up to 24 concurrent fixup jobs and splunk-optimize processes. Since we are on a very fast SAN, there is almost no iowait for storage.

Note that you will want to adjust these settings based on the size/capacity of your indexer CPU core count & available IOPS. These processes can be very IO intensive.

The settings required to make this go faster are
On the CM, in /opt/splunk_clustermaster/etc/system/local/server.conf


[clustering]
max_peer_build_load = 23

On each of the indexers, and thus managed by the CM bundle in /opt/splunk_clustermaster/etc/master-apps/_cluster/local/indexes.conf

[default]
maxRunningProcessGroups = 23

Make sure the indexers are restarted to pick up this change. When the CM restarts, it will pick up the change. After a CM restart, bucket fixup will run massively parallel across all indexers and bury them. That's okay, because when there is a 100% search outage, USE ALL THE RESOURCES! to get search back online.

Disclaimer: Splunk support may not approve of these changes. Splunk support also doesn't have to take the lashing the local Splunk admin does when Splunk is unavailable for searching. Be careful. 🙂

View solution in original post

davidpaper
Contributor

In the 5.x code (up to 5.0.2), there is a painful bug with CM. That bug is related to CM not persisting knowledge of frozen buckets across restarts (Bug ID: SPL-65100). This results in the CM kicking off dozens (hundreds? thousands?) of bucket fixup processes on the indexers. Until enough buckets are fixed up that the index becomes searchable again, there is a 100% search outage. Bad.

To speed up the recovery requires digging into the configs and altering 2 settings, based on the "size" of the indexers available. In our case, the indexers are 24 core, and can handle up to 24 concurrent fixup jobs and splunk-optimize processes. Since we are on a very fast SAN, there is almost no iowait for storage.

Note that you will want to adjust these settings based on the size/capacity of your indexer CPU core count & available IOPS. These processes can be very IO intensive.

The settings required to make this go faster are
On the CM, in /opt/splunk_clustermaster/etc/system/local/server.conf


[clustering]
max_peer_build_load = 23

On each of the indexers, and thus managed by the CM bundle in /opt/splunk_clustermaster/etc/master-apps/_cluster/local/indexes.conf

[default]
maxRunningProcessGroups = 23

Make sure the indexers are restarted to pick up this change. When the CM restarts, it will pick up the change. After a CM restart, bucket fixup will run massively parallel across all indexers and bury them. That's okay, because when there is a 100% search outage, USE ALL THE RESOURCES! to get search back online.

Disclaimer: Splunk support may not approve of these changes. Splunk support also doesn't have to take the lashing the local Splunk admin does when Splunk is unavailable for searching. Be careful. 🙂

davidpaper
Contributor

An interesting caveat to this -- if you alter the settings above to shorten the search outage when a CM restarts, it also impacts how aggressive the CM is when taking an indexing peer out of rotation. When it rebalances the cluster (making buckets searchable, replicating buckets to a new peer), it takes advantage of these settings. CPU usage goes way up during the rebalance.

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...