OS Patching Indexer Cluster and bucket fixups

jarush · ‎04-14-2020

We're struggling to do OS patching of our indexer cluster in a reasonable timeframe. It currently takes about 24 hours with the vast majority of that time just waiting on bucket fixups tasks to complete between reboots. Wondering how others are doing it without impacting any searching or filling up index process queues. Our current process:

Blast out an apt update && apt upgrade -y && apt autoremove -y to all indexes. Takes about 10 - 15 minutes to complete
Blast out a puppet no-noop to all indexers - takes about 5 minutes to complete
The for each indexer:
splunk offline - takes 5 - 10 minutes
reboot - takes < 30 seconds
Wait for bucket fix ups to complete - around 30 minutes

We've had issues using the rolling restart - it sometimes gets stuck in the middle and you have to bounce the cluster master and it doesn't resume where it left off. It also by default defers saved searches, which effectively disables alerting in our environment for a few hours (we are enabling running saved searches during rolling restarts to address this). Does this just work out of the box for others or are there secret "gold" settings that you've had to tweak?

Some information about our environment:

2 sites
24 indexers per site
Splunk 7.3.3
buckets only replicate between sites, no intra-site replication factor. Wondering if this is contributing to our problems... we've been looking into increasing storage to account for this.
5ms between sites, multi-10gbit links

dxu_splunk · ‎04-22-2020

this shouldve been something we fixed/alleviated in 7.3.2. most likely these were unclean bucket replications - where a StreamingError happened on a hot bucket target, which made the indexers throw one copy away, and triggered a fixup to re-replicate the missing bucket...

one thing to try to pinpoint this is to pick any bucket that needed fixups, and see what happened to that bucket throughout the process.

jarush · ‎04-17-2020

We did a rolling reboot last night to change some indexes and splunk aggressively rolled through and bounced all 48 in about an hour, leaving about 50k fixups on the queue.
So that doesn't seem to be our answer....

OS Patching Indexer Cluster and bucket fixups

indexer

indexer clustering

Observe and Secure All Apps with Splunk

Splunk Decoded: Business Transactions vs Business IQ

Fastest way to demo Observability

Are you a member of the Splunk Community?

OS Patching Indexer Cluster and bucket fixups

indexer

indexer clustering

Observe and Secure All Apps with Splunk

Splunk Decoded: Business Transactions vs Business IQ

Fastest way to demo Observability