In the 5.0 release, rolling-restart, apply, "rolling offline" - ie offlining all peers one after the other one at a time - are all not search-safe. Updating the configuration cluster-wide via apply really does behave like a "maintenance mode": data is safe but it may not be searchable during the rolling restart. After the rolling restart completes, the cluster should be searchable (I believe the master commits a new generation after). The docs don't seem to state this explicitly; I'll try to get those updated.
Also, we are working to fix the limitations detailed below.
To explain what is going on a bit more:
Every peer is potentially both the source and the target of ongoing hot bucket replications: it originates some hot buckets that are replicated to other peers and is the target (and potentially the searchable target; this is the problematic case) for hot buckets originating on other peers. Each peer is also the primary for the hot buckets it originates. When we offline a peer - say peer A - it rolls the hot buckets it originates cleanly and transfers the primary responsibility for those hot buckets (along with any other warm buckets it is primary for) to some other peers. It doesn't worry about any hot bucket - say bucket B1 - it is the searchable streaming target for originating from some other peer as the source is still up and is responsible for searching that bucket. So offlining one peer works by fixing up the hot buckets it originates and not worrying about the hot buckets it is receiving. For a rolling-restart though, those do come into the picture.
Now when peer A comes back, its copy of bucket B1 might be invalid. In the ace release, we don't fix up the bucket mid-stream - ie catch up on the data that has already been indexed while also keeping track of data that is now going to that bucket. Instead, the source rolls the bucket at that point. We cannot also fixup the search meta data files mid-stream. The copy on the peer that just restarted is likely invalid and is discarded and so the master fixes up the bucket. If the copy that was discarded was a searchable copy, this would mean that another copy has to be made searchable. This can take a bit of time depending on the size of the bucket. During this time, with a SF=2, the source of B1 is the only peer with a valid searchable copy for B1. if the source of B1 also goes offline, then there is no searchable copy of the bucket online while the source is restarting. (There is another copy being made searchable but it may not have finished yet; the source which has the only complete searchable copy has gone offline). So: data is not lost, but there may be no searchable copy online at that point.
Since in a cluster every peer is likely the searchable target for some bucket and every peer is going to go offline at some point or the other, the above situation is likely true for one or more buckets through-out the rolling restart process. So the cluster itself won't be search-safe through the rolling restart process.
Hope that helps to understand what is going on. If you have more questions, ask away. And, hopefully, updating the config cluster-wide is infrequent enough for you to be able to treat it as down time for searches. We are working to fix this going forward.
... View more