Large amount of buckets that need to be fixed afte...

ddrillic · ‎06-19-2019

After the upgrade to 7.1.7 last night, we had 44K buckets under Fixup Tasks – Pending,
and that was seven hours after the upgrade.

What caused so many buckets to be in the fixed category?

ddrillic · ‎06-20-2019

A related question at Is it possible during an upgrade to postpone the replication by a day or so?

Another related question at - Can we use the two sites of a multisite indexer cluster to improve the Splunk upgrade?

DavidHourani · ‎06-19-2019

Hi @ddrillic,

Seems like you forgot to go into maintenance mode ?
https://docs.splunk.com/Documentation/Splunk/latest/Indexer/Upgradeacluster

Run splunk enable maintenance-mode on the master. To confirm that the master is in maintenance mode, run splunk show maintenance-mode. This step prevents unnecessary bucket fix-ups.

ddrillic · ‎06-19-2019

@DavidHourani - we have been around for a bit ; -) and we upgraded around 150 physical indexers - maintenance mode was set.

DavidHourani · ‎06-19-2019

Yeah I did want to include a note : ( I know you wouldn't do that, just trying my luck here ) lol
How much time did you spend in maintenance mode ?

ddrillic · ‎06-19-2019

That's interesting - I would say around 40 minutes...

ddrillic · ‎06-19-2019

The SE said -

-- When the Splunk indexing cluster is put into maintenance mode (requisite for upgrades) replication of buckets between indexers stops.

Once the maintenance mode is lifted, the buckets need to “fix up”. From a technical perspective this means;
Indexer contacts the Cluster Master and registers the bucket
Cluster master checks the bucket into management and verifies it (MD5 checksum, looks for collisions etc)
Cluster master acknowledges the bucket and begins replication tasks to distribute the data between other indexers.

When you do a large upgrade maintenance mode can last a long time. During that time there’s a backlog of buckets building up to be “fixed up”. Once the maintenance mode is lifted, the entire indexing cluster begins these tasks. However, your indexers will probably be busy with other tasks as well, like indexing new data, replicating new data, fulfilling searches etc. This means that you have resource contention, and that takes time to work through.

The good news is that if the fix-up queue is getting smaller with time, your cluster is working perfectly fine. It’s just busy.
If on the other hand the fix-up queue is stalling or getting larger, you should engage Splunk Support.

ddrillic · ‎06-20-2019

I asked -

-- Once the maintenance mode is lifted, the buckets need to “fix up”.

During this maintenance mode period of around 40 minutes in our case, all the 145 indexers were shut-down, upgraded and restarted. Which buckets at that point, need to be “fix up”?

And the answer by the SE is -

All data that was in a hot bucket when the maintenance mode started and/or any data indexed after maintenance mode started. When you restart you also automatically create new hot buckets, so that has to happen too.

So minimum fixup count =
(Number of indexers)(2 hot bucket rolls)(number of indexes)

That's just the base number. The longer it goes the more buckets, but that's harder to predict.

ddrillic · ‎06-19-2019

I would say -

-- Truly, exposing underlying implementation details of the software and asking us to understand them (like TTL) is a funny approach - this is a software deficiency. About the replication, the software should provide more granular replication process, but it's black and white. If replication was more gradual and manageable we should have been fine.

DavidHourani · ‎06-19-2019

Yeap, totally agree with you. Have a read here : https://docs.splunk.com/Documentation/Splunk/7.3.0/Indexer/Usemaintenancemode#The_effect_of_maintena...

Too many cases where fix-ups are inevitable.

Gregski11 · ‎06-22-2023

I don't understand the point of this link, it literally just takes us to the online Managing Indexers and Clusters of Indexers manual Use maintenance mode section

ddrillic · ‎06-19-2019

Right @DavidHourani.

There are couple of things here, but what bothers me is -

1) excessive amount of bucket “fixed up”.
2) lack of control over the replication process, especially after such an upgrade.

DavidHourani · ‎06-19-2019

Yeah I had a lot of clients complain about point 2 especially. That can kill network links in some cases and no way to put it on hold.

ddrillic · ‎06-19-2019

No doubt - scary thing this replication process is sometimes.

Large amount of buckets that need to be fixed after the upgrade to 7.1.7.?

upgrade

Can’t make it to .conf25? Join us online!

Can’t Make It to Boston? Stream .conf25 and Learn with Haya Husain

Splunk Lantern’s Guide to The Most Popular .conf25 Sessions

Unlock What’s Next: The Splunk Cloud Platform at .conf25

Are you a member of the Splunk Community?