Hello,
I'd like to know why when we initiate a rolling restart of the indexer cluster, we see a lot of bucket fix up tasks (for search and replication factor). It causes a lot of CPU and I/O pressure on the cluster.
I understand that fixup is required when a node goes down/up (leaves or join the cluster), but I don't understand why this should append for a simple splunkd restart managed by the master.
Regards,
http://docs.splunk.com/Documentation/Splunk/6.1/Indexer/Restartthecluster says -
"When you restart a master or peer node, the master rebalances the primary bucket copies across the set of peers,..."
For some reason, it doesn't describe the toll on the cluster when doing the rolling restart.
The rolling restart essentially runs a "splunk offline" on each indexer one-by-one. An "offline" of an indexer cluster slave is a controlled shutdown, where all buckets that it is assigned as primary for are transitioned to another slave, and buckets are replicated as needed to maintain the rep factor, or made searchable as needed to maintain the search factor. This is most of the fixup tasks.
With that being said, and in particular for large clusters, a rolling restart can be quite traumatic, especially if you are glued to the Cluster Master console. The cluster will eventually recover most of the time, and so its probably best to kick of the restart, make sure it actually took, and then give it 5 minutes before checking in.
Please let me know if this answers your question! 😄
As per the documentation:
Warning: While the cluster is in maintenance mode, the master will not enforce replication factor or search factor policies. The only bucket fix-up that occurs during maintenance mode is that the master will attempt, when necessary, to reassign primaries to available searchable bucket copies. So, the cluster can be operating under a valid but incomplete status. See Indexer cluster states to understand the implications of this.
Note: The CLI commands apply cluster-bundle and rolling-restart incorporate maintenance mode functionality into their behavior by default, so there's no reason to invoke it explicitly when running those commands. A message stating that maintenance mode is on will appear on the master dashboard when you invoke these actions.
So, as i understand, there should be no fixup tasks (except searchable to primary) during a cluster rolling restart. But there is...a lot ....