Deployment Architecture

Why is a Search Head Cluster Member not replicating all changes?

mdsnmss
SplunkTrust
SplunkTrust

We recently added a new member to our search head cluster and upon changing the captain once adding the new member have been experiencing replication issues with one of the members in the cluster.

One member is not publishing its changes to the rest of the cluster and this can be seen in a dashboard created on one but not appearing on the other. The strange part is that reports will replicate. It seems like the bundle push for the problem member to the captain is taking a long time and by the time it gets there it is out of date. This appears in the logs as:

05-19-2017 11:49:30.853 -0400 WARN  ConfMetrics - single_action=PUSH_TO took wallclock_ms=118946! Consider a lower value of conf_replication_max_push_count in server.conf on all members
05-19-2017 11:49:30.853 -0400 WARN  ConfReplicationThread - Error pushing configurations to captain=<sh_captain>, consecutiveErrors=1 msg="Error in acceptPush: Non-200 status_code=400: ConfReplicationException: Cannot accept push with outdated_baseline_op_id=52b08cafbfb11ce9d453f78003f3449bb74d4829; current_baseline_op_id=36a8837153caf8be7e1ca7604851fa75dc9b4e06"
--
05-19-2017 11:51:50.296 -0400 WARN  ConfMetrics - single_action=PUSH_TO took wallclock_ms=118399! Consider a lower value of conf_replication_max_push_count in server.conf on all members
05-19-2017 11:51:50.296 -0400 WARN  ConfReplicationThread - Error pushing configurations to captain=<sh_captain>, consecutiveErrors=1 msg="Error in acceptPush: Non-200 status_code=400: ConfReplicationException: Cannot accept push with outdated_baseline_op_id=36a8837153caf8be7e1ca7604851fa75dc9b4e06; current_baseline_op_id=f662e069cf5cafa23d57fda3281422c33fe03b46"
--
05-19-2017 11:54:03.011 -0400 WARN  ConfMetrics - single_action=PUSH_TO took wallclock_ms=117277! Consider a lower value of conf_replication_max_push_count in server.conf on all members
05-19-2017 11:54:03.011 -0400 WARN  ConfReplicationThread - Error pushing configurations to captain=<sh_captain>, consecutiveErrors=1 msg="Error in acceptPush: Non-200 status_code=400: ConfReplicationException: Cannot accept push with outdated_baseline_op_id=1936b8e36a94adc7f8321bfa46889d05fd70476b; current_baseline_op_id=40f8f8a3c05895d2f295bc4b4d58c8be9d7dbe82"
--
05-19-2017 11:56:13.752 -0400 WARN  ConfMetrics - single_action=PUSH_TO took wallclock_ms=115828! Consider a lower value of conf_replication_max_push_count in server.conf on all members
05-19-2017 11:56:13.752 -0400 WARN  ConfReplicationThread - Error pushing configurations to captain=https://<sh_captain>, consecutiveErrors=1 msg="Error in acceptPush: Non-200 status_code=400: ConfReplicationException: Cannot accept push with outdated_baseline_op_id=9071097ce08bbc4988be45b8f5bc9ae5d61569e3; current_baseline_op_id=038728b4743c774e8d3a3a89d3c47f4a3be5a59d"

The problem member is a previously working member of the the cluster and is not the newly added on. It was previously the captain and after switching to another captain began running into the issue. Trying to move the captain back to this member almost crashed the cluster. Our unpublished changes on this member are not very high yet and the consecutiveErrors doesn't exceed 1 so looking at documentation it would seem a destructive resync is not needed yet. I would like to avoid that if possible and allow the changes existing on the problem member to be replicated.

0 Karma
1 Solution

mdsnmss
SplunkTrust
SplunkTrust

I was able to get all the changes replicated and the cluster member of issue to catch up in changes without issuing a resync. As suggested in the events lowering conf_replication_max_push_count did help some but did not completely resolve the issue. I used that in combination with conf_replication_summary.period to resolve the issue. It took some trial and error to find values that worked but temporarily having some low values allowed it to catch up. After allowing it to catch up and reverting to defaults it has remained in sync.

View solution in original post

mdsnmss
SplunkTrust
SplunkTrust

I was able to get all the changes replicated and the cluster member of issue to catch up in changes without issuing a resync. As suggested in the events lowering conf_replication_max_push_count did help some but did not completely resolve the issue. I used that in combination with conf_replication_summary.period to resolve the issue. It took some trial and error to find values that worked but temporarily having some low values allowed it to catch up. After allowing it to catch up and reverting to defaults it has remained in sync.

dilipbailwal
Path Finder

You will have to do re sync once, to bring the cluster member to the par of other members..

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...