Firstly my indexer cluster consists of 2 Indexers (with a 6TB volume on each) and a Cluster Master to manage them. For the most part CPU and Memory is where you expect it to be (CPU anywhere between 20-40% and memory around the same). This is with 83 sources averaging around 150-200GB a day.
We have automated RHEL OS patching that occurs on a regular schedule and obviously this means that the environment is not in maintenance mode. When the patching occurs, the indexers are patched at separate times to each other (for example 1 indexer will patch an hour after the last 1 is restarted). Consequently after the 2nd one has been patched, I see that my indexers run hot (around 90%+) in CPU and running at RHEL cut out limit (28GB of 32GB) where RHEL protects the OS and kills the SPLUNKD service which then restarts.
This goes on for something like 6-8 hours before things settle back down to the normal 20-40% utilization until the next patching cycle. Thankfully this doesn't impact us too greatly as everything eventually balances and things settle down and SPLUNK just keeps on working along (searches a little slower obviously)
We are currently running on the 7.2 stream and would like to know what's the best way to reduce this high throughput and reduce how long it takes for the CPU/Memory to balance. Would setting maintenance schedule before patching and then removing after or the next day reduce this? I also noted that when an app is installed to the Indexers (so cluster deployment) that this also tends to cause the high CPU/Memory spike for almost as long (closer to 3-4 hours). Currently I don't have scope to increase what resources I do have.
As @richgalloway said you should use maintenance mode when patching those servers. if not then the master starts to update buckets to primaries in node which are up and when another comes up do it again. And when second one goes down it start that again. And based on your nodes and storage and how long you are storing your data online this could take a long time.