We're struggling to do OS patching of our indexer cluster in a reasonable timeframe. It currently takes about 24 hours with the vast majority of that time just waiting on bucket fixups tasks to complete between reboots. Wondering how others are doing it without impacting any searching or filling up index process queues. Our current process:
We've had issues using the rolling restart - it sometimes gets stuck in the middle and you have to bounce the cluster master and it doesn't resume where it left off. It also by default defers saved searches, which effectively disables alerting in our environment for a few hours (we are enabling running saved searches during rolling restarts to address this). Does this just work out of the box for others or are there secret "gold" settings that you've had to tweak?
Some information about our environment:
this shouldve been something we fixed/alleviated in 7.3.2. most likely these were unclean bucket replications - where a StreamingError happened on a hot bucket target, which made the indexers throw one copy away, and triggered a fixup to re-replicate the missing bucket...
one thing to try to pinpoint this is to pick any bucket that needed fixups, and see what happened to that bucket throughout the process.
We did a rolling reboot last night to change some indexes and splunk aggressively rolled through and bounced all 48 in about an hour, leaving about 50k fixups on the queue.
So that doesn't seem to be our answer....