Solved: What is the best action plan during hardware/firmw...

mlevsh · ‎10-20-2016

We have 8 Splunk indexers in our environment (2 sites).
One indexer server needs to be serviced: update the BIOS, RAID controller firmware and iLO firmware.
What's the best business practice in these kind of cases?

Do we need to enable maintenance mode by running "splunk enable maintenance-mode"?
Do we need to take the peer offline by running "splunk offline"?
Do we need to disable start of Splunk on reboot temporary by running splunk disable boot-start in case this service task requires numerous reboots during the maintenance?
Any other advice?

Thank you in advance

Raghav2384 · ‎10-20-2016

Hey @mlevsh

8 Indexers (2 sites) i am assuming multi site indexer cluster?

If you are not modifying/altering the application (Example: Splunk upgrade etc) you could simple use the offline command and yes it's safe to temporarily disable boot start if the patching requires multiple reboots.

Do you have a source to backup data data from that indexer? Not that you require to backup data....always safe to have a copy while working on these machines. Since it's a multisite cluster...hoping data is always a mimic on the second site...check for data size .

But yeah, offline command and temporarily disabling boot start should be sufficient.

Thanks,
Raghav

View solution in original post

Raghav2384 · ‎10-20-2016

Hey @mlevsh

8 Indexers (2 sites) i am assuming multi site indexer cluster?

If you are not modifying/altering the application (Example: Splunk upgrade etc) you could simple use the offline command and yes it's safe to temporarily disable boot start if the patching requires multiple reboots.

Do you have a source to backup data data from that indexer? Not that you require to backup data....always safe to have a copy while working on these machines. Since it's a multisite cluster...hoping data is always a mimic on the second site...check for data size .

But yeah, offline command and temporarily disabling boot start should be sufficient.

Thanks,
Raghav

mlevsh · ‎10-21-2016

@Raghav2384 , thank you!

One verification. Do we need to extend the restart period by running "splunk edit cluster-config -restart_timeout " on cluster master? Lets say we can put 7200 seconds for 2 hours

Also, we don't really know how long the maintenance (splunk edit cluster-config -restart_timeout ) will take. How should this situation be handled?

Raghav2384 · ‎10-22-2016

I do not think that is required. Since you are working on only one out of 8 indexers, I would just put that indexer in offline mode and disable boot start. We applied OS patches on our 33 indexer cluster by taking one indexer offline at the time.

IT did take lot of time but I never had to touch my master ot the configurations on the master once during the process. As long as the work you are about to do doesn't alter the application or configs, you should be alright.

I apologize for the delay.

Thanks,
Raghav

mlevsh · ‎10-24-2016

Raghav,
thank you again for your reply!

I'm a little concerned though. Our system has "restart_timeout" value set to 60 seconds. Last time our unix SAs applied the update , it took them about 3 hrs as some issues occurred. Never ran splunk offline before. If it starts to bring peer online after "restart_temeout" value which is 60 sec and we need 3 hrs, what would be an impact on our system?

jagadeeshm · ‎01-04-2017

@mlevsh - What did you finally end-up doing?

vermasa · ‎09-19-2019

After the peer shuts down, you have 60 seconds (by default) to complete any maintenance work and bring the peer back online. If the peer does not return to the cluster within this time, the master initiates bucket-fixing activities to return the cluster to a complete state.

It does not start bringing peers online ...

Also, when maintenance mode is enabled, restart_timeout doesn't matter since maintenance mode avoid any bucket fixup activity.

What is the best action plan during hardware/firmware maintenance of a Splunk Linux server?

Introducing the Splunk Community Dashboard Challenge!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...