I’m looking for best practice guidance on the order of operations for bringing down a distributed Splunk environment on Linux and then the order to bring the servers back up. I am okay with a period of downtime to allow for any operating system patching, rebooting of servers, etcetera, but I want to avoid any corruption of orphaned data caused by bringing Splunk nodes down in the wrong order or without notification to the master servers. I don’t want my servers struggling to maintain replication and search factors that leads to orphan data or problems starting services.
In short, I am looking to:
• Understand all the dependencies and orders of operation
• Script a graceful shutdown of the Splunk environment
• Do whatever maintenance is called for which could include rebooting the servers
• Script a graceful startup of the environment (or in the case of reboots, determine the correct order to start servers with boot-start enabled)
Here is my distributed environment for reference:
• Deployment server / license server
• Search head cluster deployer
• Multiple search heads
• Index cluster master / Distributed management console
• Multiple indexers
• Heavy weight forwarders
For your indexer cluster, I would follow the upgrade steps outlined in the documentation to take the whole cluster down.
Deployment Server, License server, forwarders and Search head cluster deployer can go down whenever, they have no implications on your cluster health.
If you need to do maintenance on indexer peers without affecting your search availability, follow the procedures documented here.
You can also use maintenance mode for any kind of maintenance work. It will suspend all bucket fixing activities by the cluster master, but you need to remember to disable it after you are done.
Besides a Splunk upgrade of your indexer cluster, I cannot think of any requirement to take all Splunk components down simultaneously (unless I misunderstood you). In fact, I would try to avoid that, as your forwarders will fall behind while your indexing tier is unavailable to process inbound events. If at all possible, maintain your ability to process inbound events, even while doing certain maintenance activities in your environment. A distributed deployment allows you to achieve that.
I hope I didn't completely misunderstand your question... 😉
To add to this, your DS / LM / HF can be taken down with out and effect to your search or indexing tier (aside from the associated data ingestion.) So typically, the maintenance windows on those roles can be whenever.
You do need to pay attention to version compatibility. Between forwarders and indexers typically isnt a issue, but the search tier to indexing tier does have some constraints..
Forwarder Requirements and Compatibility -
Distributed Search Requirements -