Installation

Large amount of buckets that need to be fixed after the upgrade to 7.1.7.?

ddrillic
Ultra Champion

After the upgrade to 7.1.7 last night, we had 44K buckets under Fixup Tasks – Pending,
and that was seven hours after the upgrade.

What caused so many buckets to be in the fixed category?

Labels (1)
Tags (3)
0 Karma

ddrillic
Ultra Champion
0 Karma

DavidHourani
Super Champion

Hi @ddrillic,

Seems like you forgot to go into maintenance mode ?
https://docs.splunk.com/Documentation/Splunk/latest/Indexer/Upgradeacluster

Run splunk enable maintenance-mode on the master. To confirm that the master is in maintenance mode, run splunk show maintenance-mode. This step prevents unnecessary bucket fix-ups. 
0 Karma

ddrillic
Ultra Champion

@DavidHourani - we have been around for a bit ; -) and we upgraded around 150 physical indexers - maintenance mode was set.

DavidHourani
Super Champion

Yeah I did want to include a note : ( I know you wouldn't do that, just trying my luck here ) lol
How much time did you spend in maintenance mode ?

0 Karma

ddrillic
Ultra Champion

That's interesting - I would say around 40 minutes...

0 Karma

ddrillic
Ultra Champion

The SE said -

-- When the Splunk indexing cluster is put into maintenance mode (requisite for upgrades) replication of buckets between indexers stops.

Once the maintenance mode is lifted, the buckets need to “fix up”. From a technical perspective this means;
Indexer contacts the Cluster Master and registers the bucket
Cluster master checks the bucket into management and verifies it (MD5 checksum, looks for collisions etc)
Cluster master acknowledges the bucket and begins replication tasks to distribute the data between other indexers.

When you do a large upgrade maintenance mode can last a long time. During that time there’s a backlog of buckets building up to be “fixed up”. Once the maintenance mode is lifted, the entire indexing cluster begins these tasks. However, your indexers will probably be busy with other tasks as well, like indexing new data, replicating new data, fulfilling searches etc. This means that you have resource contention, and that takes time to work through.

The good news is that if the fix-up queue is getting smaller with time, your cluster is working perfectly fine. It’s just busy.
If on the other hand the fix-up queue is stalling or getting larger, you should engage Splunk Support.

0 Karma

ddrillic
Ultra Champion

I asked -

-- Once the maintenance mode is lifted, the buckets need to “fix up”.

During this maintenance mode period of around 40 minutes in our case, all the 145 indexers were shut-down, upgraded and restarted. Which buckets at that point, need to be “fix up”?

And the answer by the SE is -

All data that was in a hot bucket when the maintenance mode started and/or any data indexed after maintenance mode started. When you restart you also automatically create new hot buckets, so that has to happen too.

So minimum fixup count =
(Number of indexers)(2 hot bucket rolls)(number of indexes)

That's just the base number. The longer it goes the more buckets, but that's harder to predict.

0 Karma

ddrillic
Ultra Champion

I would say -

-- Truly, exposing underlying implementation details of the software and asking us to understand them (like TTL) is a funny approach - this is a software deficiency. About the replication, the software should provide more granular replication process, but it's black and white. If replication was more gradual and manageable we should have been fine.

0 Karma

DavidHourani
Super Champion

Yeap, totally agree with you. Have a read here : https://docs.splunk.com/Documentation/Splunk/7.3.0/Indexer/Usemaintenancemode#The_effect_of_maintena...

Too many cases where fix-ups are inevitable.

0 Karma

ddrillic
Ultra Champion

Right @DavidHourani.

There are couple of things here, but what bothers me is -

1) excessive amount of bucket “fixed up”.
2) lack of control over the replication process, especially after such an upgrade.

DavidHourani
Super Champion

Yeah I had a lot of clients complain about point 2 especially. That can kill network links in some cases and no way to put it on hold.

ddrillic
Ultra Champion

No doubt - scary thing this replication process is sometimes.

0 Karma
.conf21 CFS Extended through 5/20!

Don't miss your chance
to share your Splunk
wisdom in-person or
virtually at .conf21!

Call for Speakers has
been extended through
Thursday, 5/20!