We are having an issue recently where a rolling restart of our indexer cluster can take 12-24 hours for 18 indexers. We are on Splunk 7.0.7. We pushed some changes a couple weekends ago and it took about 22 hours to complete the restart of all indexers. The weekend before it took about 10 hours. Is anyone else seeing rolling restarts take this long? How much time should we expect and 18 indexer cluster to complete restarts? Any advice on where to look as to why it is taking so long?
Indexers usually takes time when they restart because hot buckets will roll to warm. When you have very high volumes going in to each indexer it will slow down the restart process.
A couple of things to consider to help make the process smoother:
1- Are you using a normal rolling restart or searchable rolling restart ? If your restarts are taking too long then using searchable rolling restart is a good way to have minimal search interruption :
2- Check your replication and search factor if you're using multisite with sites having a single copy of the data then running a rolling restart without specifying the percentage of hosts will slow down the rolling restart process. Default for rolling restart is 10% of the server restarting at the same time, in your case since you have 18 servers then 2 will restart at the same time possibly causing holes in your scheduled searches depending on your RF and SF :
3- Check the details of your current Splunk version for anything related to slow restarts, could help find some bugs related to that and upgrading just might fix it ^^
I can´t tell you how long it should be, but maybe a second experience could help. In the biggest environment we have 9 Indexers.
For me, after a bundle push, it's sometimes not clear if a rolling restart is necessary or not.
I do see that the restart, if necessary, can take up to a couple of hours. Within this time period, check the splunkd.log of your indexer(s) on the CLI. You will probably see a lot of bucket moving, normally for every index you got. And, after the rolling restart, most of the time there will be a lot of fix up tasks too. You can check them in the cluster master dashboard under indexes->buckets. Until these fixup tasks are not done, the cluster will not met all of its factors.
In the docs there is not much about a "slow restart", but you might, want to check this page. You can edit some values that might lead to a faster restart. : http://docs.splunk.com/Documentation/Splunk/7.2.0/Indexer/Userollingrestart#Handle_slow_restarts