Can someone describe the conditions the cluster master will wait for when scheduling restarts of cluster peers when I have run
splunk apply cluster bundle?
We have 8 peers in total.
3 in site1, 2 in site2, 3 in site 3.
We have not varied the percent_peers_to_restart value from its default of 10 percent.
When we run
splunk apply cluster-bundle and the CM calls a restart on the 8 cluster peers, we regularly see more than 1 indexer down at once and often see more than one indexer down in the same site.
As I understand it, this should not happen - hence me wanting to understand what the CM waits for before starting the next restart.
I have extended the following from the defaults:
[clustering] restart_timeout = 300
Of our 8 peers, 6 of them start themselves in 5-7 minutes, but 2 take up to 20 minutes.
By "start", I mean they start, and check what buckets they have in place and report them to the cluster master.
It does not look like the CM waits for the peers to complete that activity before kicking off the restart of the next peer, so we generally get people running searches and getting incomplete results warnings.
Currently, if the IDX takes longer than the
restart_timeout to come back on-line, the CM marks the IDX as "down". Counter-intuitively, this frees up a slot in the CM's IDX restart queue, and it moves to the next IDX. Of course, this impacts the total number of IDXes available.
In addition, the CM does not take into account SF or RF when doing a rolling restart. E.g., it doesn't check to see if the current IDX it's restarting has the last searchable copy if other IDXes are marked as down.
The only way to avoid this issue and keep the data 100% searchable throughout a maintenance like this is to make a multi-site cluster (which could be in the same DC) and use the
-site-by-site flag as mentioned in the link that mbrown mentioned.
If you have large numbers of buckets per IDX, this can increase the amount of time it takes an IDX to restart. Generally, you will see fewer issues if you keep the IDXes at under 100K buckets each. One of our primary index cluster developers gave a talk at our user's conference last year, and provided some recommendations for cluster tuning based on bucket counts. Slide 15 has a table with recommendations on tuning 'service_interval' (on the CM), 'heartbeat_period' (on IDXes), 'heartbeat_timeout' (on the CM), as well as a few other settings.
You can read the slides here: https://conf.splunk.com/session/2015/conf2015_Dxu_Splunk__Deploying_IndexerClusteringTips.pdf
A recording of the talk is available here:
As I stated in my comment above, if you have large numbers of buckets with timestamp issues, this can cause problems with cluster rolling restarts. If a timestamp is far in the past compared to the current time, and you're using time-based retention, this will cause buckets to roll prematurely, creating many small buckets.
At a minimum, consider running
splunk remove excess-buckets [index-name] periodically, particularly if you have had any IDX outages, as the CM does not remove excess buckets automatically.
Thanks for the links to the .conf slides, i'll take a look.
Your comments on the possible reasons why you might see many indexers down at once all seem alarmingly familiar.
1. An indexer taking longer than "restart_timeout" to restart permits the next indexer to be restarted. We have "restart_timeout" set to 600, which is fine for all but two of our indexers. I will increase it. This will help.
2. Indexers with more than 100,000 buckets. Check 😞
3. Our oldest indexers have crufty NTFS filesystems which appear to exhibit substantially slower IO than our newer indexers (Stat-ing 100,000 buckets takes a lot longer on two of them than on the others)
4. Large numbers of buckets with timestamping issues - Check 😞 Now fixed, but they're still in there as they haven't expired out yet.
5. We still use the default restart percentage of 10 percent, there is no sense in changing that. We have 8 indexers spread across three sites, this should mean that we should only ever see a single indexer down at one time. We must be violating the "restart_timeout" figure on several indexers.
With regards to the "why aren't you using the -site-by-site flag" - well, I can't, because I'm not running "rolling-restart", I'm running "apply-cluster-bundle", which doesn't let me use that flag and uses the built in heuristics - it invokes the rolling restart without me getting the chance to say "-site-by-site"
I keep on top of the surplus buckets, it doesn't seem to have got too bad in that respect.
I suggest opening a support case requesting the
site-by-site flag be added as an option to
apply-cluster-bundle. In the case, ask for it to be assigned to me, and I'll submit it.
To expand on @lohitkidu answer, by default the rolling restart is not site aware and needs to be invoked with site awareness for multisite clusters.
Details of this can be located within the "Managing Indexers and Clusters of Indexers" documentation: http://docs.splunk.com/Documentation/Splunk/6.4.0/Indexer/Userollingrestart
Another point, if you have large numbers of buckets with timestamp issues, this can cause problems with cluster rolling restarts. Reducing the overall number of buckets in the cluster can help reduce restart times. At a minimum, consider running:
splunk remove excess-buckets [index-name]