I have a quite sizeable environment on which I need to perform underlying Linux OS updates.
There will most probably be kernel updates so I will need to reboot the machines in order to apply those.
So I'm wondering what precautions do I have to take while restarting splunk components.
To make situation more interesting, my cluster has replication factor of just one (don't ask).
I have a indexer cluster as well as 3-member search-head cluster.
I will not be doing any splunk software updates at the moment. It's at 8.1.something and for now it stays at this version.
Any advice on the order of upgrades/restarts?
If I understand correctly, I should be able to freely restart the master node, deployer and deployment server and it should not affect my environment.
Should I do anything to search heads? I mean - will the captaincy migrate on its own if I choose to down a server being active captain or should I force the captaincy migration manually? Of course after reboot of every single machine I would wait for it to fully start and rejoin the SHC.
In case of indexers, I'm fully aware that restarting single indexer with replication factor of just one will mean that during the downtime my results will be incomplete but are there any possible issues other than that? (I have 4 HF-s load-balancing across all my indexers so I understand that I should have no problem with event loss). I understand that I should do splunk offline before rebooting the machine, right? Anything else?
Of course I will restart one machine at a time and will wait for it to fully restart, rejoin and so on, wherever applicable.
Any other caveats?
I have done these like live update on splunk version. Unfortunately your RF/SF makes some challenges here and makes impossible to do it without service break when you are doing indexer OS updates.
Your must analyse that how long service break is accepted and when it can take. Or is it possible that getting false results is accepted.
Personally I just follow https://community.splunk.com/t5/Installation/What-s-the-order-of-operations-for-upgrading-Splunk-Ent... that guide for order of update. But as you probably need that service break I will do all indexers at same time as quickly as possible. Time by time you can do OS update so that splunkd is even up until it's time to do restart. That way you probably could squeeze the service break. All other nodes you could/should update one by one without service break after idx layer is up.
Of course normal communications to all users etc. is needed and so on...
I'm of course fully aware of the lack of availability of data during indexer restart.
I would however prefer if it's possible to keep most of the indexers running and restart them one by one to keep the data ingestion from the HF's working. I assume that I should, for the downtime of one indexer, should slightly skewed buckets distribution but I think I can live with that.
You can do it also that way. Just remember that your alerts etc. didn't work correctly if there is some data missing. Also reports and updating summary indexes etc. can fail or at least leads to wrong situation.
I think that you should put CM to maintenance mode and then use splunk stop (not offline) for individual indexers or otherwise it could start to replicate those buckets to another node? I haven't have ever multimode cluster with RF=1, so please test it and report back. 😉
OK. I finally had the opportunity to do the updates and reboots. I did a "rolling restart" of nodes. Nothing bad happened. Apart from searches returning incomplete data for a few minutes, noone would even notice.
Ok. It seems that someone (probably the guys from infrastructure management) decided to test it for me without my knowledge 🤣
One of the indexers were rebooted without my knowledge and one of the search-heads as well. It seems there are no problems at all. Since the splunk service was not enabled for autostart on boot, for some strange reason (I just "inherited" this installation, I don't know why someone did it this way), the indexer was down for several hours. But apart from obvious slight asymmetry in bucket distribution there's nothing wrong.
It seems then you can freely restart your splunk components (indexers, search heads) one by one and nothing really bad should happen (apart of course from returning partial results during the downtime).
You should be happy that there haven't been automatic boot start on place and they don't do that for all your indexers 😉 In those cases the result could be "a little bit" different. Of course this depends how big your environment is and how much you are ingesting daily base.