We're struggling to do OS patching of our indexer cluster in a reasonable timeframe. It currently takes about 24 hours with the vast majority of that time just waiting on bucket fixups tasks to complete between reboots. Wondering how others are doing it without impacting any searching or filling up index process queues. Our current process:
Blast out an apt update && apt upgrade -y && apt autoremove -y to all indexes. Takes about 10 - 15 minutes to complete
Blast out a puppet no-noop to all indexers - takes about 5 minutes to complete
The for each indexer:
splunk offline - takes 5 - 10 minutes
reboot - takes < 30 seconds
Wait for bucket fix ups to complete - around 30 minutes
We've had issues using the rolling restart - it sometimes gets stuck in the middle and you have to bounce the cluster master and it doesn't resume where it left off. It also by default defers saved searches, which effectively disables alerting in our environment for a few hours (we are enabling running saved searches during rolling restarts to address this). Does this just work out of the box for others or are there secret "gold" settings that you've had to tweak?
Some information about our environment:
24 indexers per site
buckets only replicate between sites, no intra-site replication factor. Wondering if this is contributing to our problems... we've been looking into increasing storage to account for this.
this shouldve been something we fixed/alleviated in 7.3.2. most likely these were unclean bucket replications - where a StreamingError happened on a hot bucket target, which made the indexers throw one copy away, and triggered a fixup to re-replicate the missing bucket...
one thing to try to pinpoint this is to pick any bucket that needed fixups, and see what happened to that bucket throughout the process.
We did a rolling reboot last night to change some indexes and splunk aggressively rolled through and bounced all 48 in about an hour, leaving about 50k fixups on the queue.
So that doesn't seem to be our answer....