A related question at Is it possible during an upgrade to postpone the replication by a day or so?
Another related question at - Can we use the two sites of a multisite indexer cluster to improve the Splunk upgrade?
Seems like you forgot to go into maintenance mode ?
Run splunk enable maintenance-mode on the master. To confirm that the master is in maintenance mode, run splunk show maintenance-mode. This step prevents unnecessary bucket fix-ups.
The SE said -
-- When the Splunk indexing cluster is put into maintenance mode (requisite for upgrades) replication of buckets between indexers stops.
Once the maintenance mode is lifted, the buckets need to “fix up”. From a technical perspective this means;
Indexer contacts the Cluster Master and registers the bucket
Cluster master checks the bucket into management and verifies it (MD5 checksum, looks for collisions etc)
Cluster master acknowledges the bucket and begins replication tasks to distribute the data between other indexers.
When you do a large upgrade maintenance mode can last a long time. During that time there’s a backlog of buckets building up to be “fixed up”. Once the maintenance mode is lifted, the entire indexing cluster begins these tasks. However, your indexers will probably be busy with other tasks as well, like indexing new data, replicating new data, fulfilling searches etc. This means that you have resource contention, and that takes time to work through.
The good news is that if the fix-up queue is getting smaller with time, your cluster is working perfectly fine. It’s just busy.
If on the other hand the fix-up queue is stalling or getting larger, you should engage Splunk Support.
I asked -
-- Once the maintenance mode is lifted, the buckets need to “fix up”.
During this maintenance mode period of around 40 minutes in our case, all the 145 indexers were shut-down, upgraded and restarted. Which buckets at that point, need to be “fix up”?
And the answer by the SE is -
All data that was in a hot bucket when the maintenance mode started and/or any data indexed after maintenance mode started. When you restart you also automatically create new hot buckets, so that has to happen too.
So minimum fixup count =
(Number of indexers)(2 hot bucket rolls)(number of indexes)
That's just the base number. The longer it goes the more buckets, but that's harder to predict.
I would say -
-- Truly, exposing underlying implementation details of the software and asking us to understand them (like TTL) is a funny approach - this is a software deficiency. About the replication, the software should provide more granular replication process, but it's black and white. If replication was more gradual and manageable we should have been fine.
There are couple of things here, but what bothers me is -
1) excessive amount of bucket “fixed up”.
2) lack of control over the replication process, especially after such an upgrade.