Deployment Architecture

OS Patching Indexer Cluster and bucket fixups

jarush
Explorer

We're struggling to do OS patching of our indexer cluster in a reasonable timeframe. It currently takes about 24 hours with the vast majority of that time just waiting on bucket fixups tasks to complete between reboots. Wondering how others are doing it without impacting any searching or filling up index process queues. Our current process:

  1. Blast out an apt update && apt upgrade -y && apt autoremove -y to all indexes. Takes about 10 - 15 minutes to complete
  2. Blast out a puppet no-noop to all indexers - takes about 5 minutes to complete
  3. The for each indexer:
  4. splunk offline - takes 5 - 10 minutes
  5. reboot - takes < 30 seconds
  6. Wait for bucket fix ups to complete - around 30 minutes

We've had issues using the rolling restart - it sometimes gets stuck in the middle and you have to bounce the cluster master and it doesn't resume where it left off. It also by default defers saved searches, which effectively disables alerting in our environment for a few hours (we are enabling running saved searches during rolling restarts to address this). Does this just work out of the box for others or are there secret "gold" settings that you've had to tweak?

Some information about our environment:

  • 2 sites
  • 24 indexers per site
  • Splunk 7.3.3
  • buckets only replicate between sites, no intra-site replication factor. Wondering if this is contributing to our problems... we've been looking into increasing storage to account for this.
  • 5ms between sites, multi-10gbit links
Labels (2)
0 Karma

dxu_splunk
Splunk Employee
Splunk Employee

this shouldve been something we fixed/alleviated in 7.3.2. most likely these were unclean bucket replications - where a StreamingError happened on a hot bucket target, which made the indexers throw one copy away, and triggered a fixup to re-replicate the missing bucket...

one thing to try to pinpoint this is to pick any bucket that needed fixups, and see what happened to that bucket throughout the process.

0 Karma

jarush
Explorer

We did a rolling reboot last night to change some indexes and splunk aggressively rolled through and bounced all 48 in about an hour, leaving about 50k fixups on the queue.
So that doesn't seem to be our answer....

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!