What are people's experiences and expectations of the impact caused by a multisite indexer cluster rolling restart (on version 6.4.3)? By "impact", I mean that a complete set of events is not returned in searches.
I can identify two classes of data: "historical" data that was sent to the cluster before it was put into maintenance mode, and "recent" data sent to the cluster after it was put into maintenance mode.
For the purposes of example let's assume:
- A two site cluster with search factor of 2, and 1 copy forced in each site
- The rolling restart will never restart more than 1 indexer at a time
I had expected that there would be no search impact to "historical" data, as there should always be a second searchable copy of the data within the cluster. This would require search heads to break site affinity and search across site, but according to the docs on cluster maintenance mode, this should happen as the cluster master will still attempt to reassign primaries. In testing, I have found that historical data availability is severely and unpredictably affected. Often a search head will not search across sites at all towards the beginning of a rolling restart. Later in the rolling restart, it starts to search across sites, but data is still not complete.
I can see that potentially there could be an impact to "recent" data while the rolling restart is in progress. An event could have been written to a certain indexer, not replicated since the cluster was in maintenance mode, and then the indexer that holds it goes down for its restart and the data becomes unavailable until that indexer returns.
Does this mean that it is the case that it is not possible to restart an indexer cluster without severely impacting data searchability, and so it becomes necessary to prevent user access throughout, as well as disabling alerting and anything else that relies on search? The docs seem to say that indexer clustering provides high availability where the data is always available for searching, but this appears to be a false claim.
If this impact is real and I haven't stuffed it up somehow, how can it be mitigated?
A rolling restart triggers maintenance mode within the cluster master. This mode means that "fixup" activity won't be undertaken. Fix up activity includes (but is not limited to) marking buckets as "primary" from available searchable copies. There are effectively three states for indexing buckets within a cluster: replica, searchable, primary. Primary is a searchable copy that will service searches. Even in a multi-site cluster, there is only one "primary" for a given site.
Search availability during rolling restart is a popular request from Splunk customers, and I feel that it's something that engineering should take under advisement.
Yes, thanks for your answer sowings! At least it confirms that we were not doing things incorrectly and our cluster was functioning as intended by Splunk. In this case, I don't think it is legitimate to claim that indexer clustering provides "High Availability" (as in uptime) as it stands. While it provides HA for data ingestion, and Disaster Recoverability for data buckets, search functionality is not "up", and this makes up the bulk of the functionality of the product.