We recently added a new member to our search head cluster and upon changing the captain once adding the new member have been experiencing replication issues with one of the members in the cluster.
One member is not publishing its changes to the rest of the cluster and this can be seen in a dashboard created on one but not appearing on the other. The strange part is that reports will replicate. It seems like the bundle push for the problem member to the captain is taking a long time and by the time it gets there it is out of date. This appears in the logs as:
05-19-2017 11:49:30.853 -0400 WARN ConfMetrics - single_action=PUSH_TO took wallclock_ms=118946! Consider a lower value of conf_replication_max_push_count in server.conf on all members
05-19-2017 11:49:30.853 -0400 WARN ConfReplicationThread - Error pushing configurations to captain=<sh_captain>, consecutiveErrors=1 msg="Error in acceptPush: Non-200 status_code=400: ConfReplicationException: Cannot accept push with outdated_baseline_op_id=52b08cafbfb11ce9d453f78003f3449bb74d4829; current_baseline_op_id=36a8837153caf8be7e1ca7604851fa75dc9b4e06"
--
05-19-2017 11:51:50.296 -0400 WARN ConfMetrics - single_action=PUSH_TO took wallclock_ms=118399! Consider a lower value of conf_replication_max_push_count in server.conf on all members
05-19-2017 11:51:50.296 -0400 WARN ConfReplicationThread - Error pushing configurations to captain=<sh_captain>, consecutiveErrors=1 msg="Error in acceptPush: Non-200 status_code=400: ConfReplicationException: Cannot accept push with outdated_baseline_op_id=36a8837153caf8be7e1ca7604851fa75dc9b4e06; current_baseline_op_id=f662e069cf5cafa23d57fda3281422c33fe03b46"
--
05-19-2017 11:54:03.011 -0400 WARN ConfMetrics - single_action=PUSH_TO took wallclock_ms=117277! Consider a lower value of conf_replication_max_push_count in server.conf on all members
05-19-2017 11:54:03.011 -0400 WARN ConfReplicationThread - Error pushing configurations to captain=<sh_captain>, consecutiveErrors=1 msg="Error in acceptPush: Non-200 status_code=400: ConfReplicationException: Cannot accept push with outdated_baseline_op_id=1936b8e36a94adc7f8321bfa46889d05fd70476b; current_baseline_op_id=40f8f8a3c05895d2f295bc4b4d58c8be9d7dbe82"
--
05-19-2017 11:56:13.752 -0400 WARN ConfMetrics - single_action=PUSH_TO took wallclock_ms=115828! Consider a lower value of conf_replication_max_push_count in server.conf on all members
05-19-2017 11:56:13.752 -0400 WARN ConfReplicationThread - Error pushing configurations to captain=https://<sh_captain>, consecutiveErrors=1 msg="Error in acceptPush: Non-200 status_code=400: ConfReplicationException: Cannot accept push with outdated_baseline_op_id=9071097ce08bbc4988be45b8f5bc9ae5d61569e3; current_baseline_op_id=038728b4743c774e8d3a3a89d3c47f4a3be5a59d"
The problem member is a previously working member of the the cluster and is not the newly added on. It was previously the captain and after switching to another captain began running into the issue. Trying to move the captain back to this member almost crashed the cluster. Our unpublished changes on this member are not very high yet and the consecutiveErrors doesn't exceed 1 so looking at documentation it would seem a destructive resync is not needed yet. I would like to avoid that if possible and allow the changes existing on the problem member to be replicated.
I was able to get all the changes replicated and the cluster member of issue to catch up in changes without issuing a resync. As suggested in the events lowering conf_replication_max_push_count did help some but did not completely resolve the issue. I used that in combination with conf_replication_summary.period to resolve the issue. It took some trial and error to find values that worked but temporarily having some low values allowed it to catch up. After allowing it to catch up and reverting to defaults it has remained in sync.
I was able to get all the changes replicated and the cluster member of issue to catch up in changes without issuing a resync. As suggested in the events lowering conf_replication_max_push_count did help some but did not completely resolve the issue. I used that in combination with conf_replication_summary.period to resolve the issue. It took some trial and error to find values that worked but temporarily having some low values allowed it to catch up. After allowing it to catch up and reverting to defaults it has remained in sync.
You will have to do re sync once, to bring the cluster member to the par of other members..