We have 8 Splunk indexers in our environment (2 sites).
One indexer server needs to be serviced: update the BIOS, RAID controller firmware and iLO firmware.
What's the best business practice in these kind of cases?
splunk disable boot-startin case this service task requires numerous reboots during the maintenance?
Thank you in advance
8 Indexers (2 sites) i am assuming multi site indexer cluster?
If you are not modifying/altering the application (Example: Splunk upgrade etc) you could simple use the offline command and yes it's safe to temporarily disable boot start if the patching requires multiple reboots.
Do you have a source to backup data data from that indexer? Not that you require to backup data....always safe to have a copy while working on these machines. Since it's a multisite cluster...hoping data is always a mimic on the second site...check for data size .
But yeah, offline command and temporarily disabling boot start should be sufficient.
@Raghav2384 , thank you!
One verification. Do we need to extend the restart period by running "splunk edit cluster-config -restart_timeout " on cluster master? Lets say we can put 7200 seconds for 2 hours
Also, we don't really know how long the maintenance (splunk edit cluster-config -restart_timeout ) will take. How should this situation be handled?
I do not think that is required. Since you are working on only one out of 8 indexers, I would just put that indexer in offline mode and disable boot start. We applied OS patches on our 33 indexer cluster by taking one indexer offline at the time.
IT did take lot of time but I never had to touch my master ot the configurations on the master once during the process. As long as the work you are about to do doesn't alter the application or configs, you should be alright.
I apologize for the delay.
thank you again for your reply!
I'm a little concerned though. Our system has "restarttimeout" value set to 60 seconds. Last time our unix SAs applied the update , it took them about 3 hrs as some issues occurred. Never ran splunk offline before. If it starts to bring peer online after "restarttemeout" value which is 60 sec and we need 3 hrs, what would be an impact on our system?
After the peer shuts down, you have 60 seconds (by default) to complete any maintenance work and bring the peer back online. If the peer does not return to the cluster within this time, the master initiates bucket-fixing activities to return the cluster to a complete state.
It does not start bringing peers online ...
Also, when maintenance mode is enabled, restart_timeout doesn't matter since maintenance mode avoid any bucket fixup activity.