SHC - failed on handle async replicate request

ahartge · ‎10-15-2019

I have noticed something odd in a SHC deployment. Im consistently seeing "SHCMasterArtifactHandler - failed on handle async replicate request" errors, these report to be caused by the reason "active replication count >= max_peer_rep_load"

While these errors dont appear to be causing any actual impact on executing scheduled searches or anything else, I would like to get to the bottom of what is causing them. It doesnt seem to be a particular node or search or user that these occur for.

See the end of the post for the relevant error messages.

There are 4x nodes in the multi-site search head cluster (2 in each site)

The [shclustering] stanza of server.conf has replication_factor=2 configured

Where I suspect the problem is occuring is the search heads not implimenting the replication_factor setting, because when I run the command "splunk list shcluster-members" I see the replication_count numbers vary between 3 - 5 - I should expect to see 2 here shouldnt I ?

If the default max_peer_rep_load = 5 and the replication_count of at least one of those search heads is showing 5 when I check them - then I assume this is what is causing the excess replication to occur ?

When I run the command "splunk list shcluster-config" against all the nodes - they correctly show [max_peer_rep_load = 5] and [replication_factor = 2]

Has anyone seen this in a search head cluster before ? - I have read https://answers.splunk.com/answers/242905/shc-troubleshooting-configurations-under-search-he.html and looked into the settings from the last comment from SplunkIT - but want to verify before I go tweaking configs before finding the root cause of the issue.

Thanks in advance for anyone's insight into what may be causing it and how to further troubleshoot / resolve.

10-15-2019 09:50:15.153 +0000 ERROR SHCMasterArtifactHandler - failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'

10-15-2019 09:50:15.159 +0000 ERROR SHCRepJob - job=SHPAsyncReplicationJob aid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B srcGuid=4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B tgtGuid=7A3F5991-2373-40BA-998A-79193A40CF27 failed. reason failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"

10-15-2019 09:50:15.159 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"

BainM · ‎02-20-2020

I guess no one else is experiencing this? I suspect that a complete destructive resynch of the entire cluster is in order.

andsov · ‎03-23-2021

Did you ever find a solution to this?

I am experiencing the same problem in my environment. It seems to be connected to some searches which is using a "nested setup" in the following way:

1. Summary indexing search is storing events to a new index.
2. Statistics from the new index is saved in reports.
3. The reports are used as the foundation for new searches to populate the panels in dashboards ( through | loadjob).

liuce1 · ‎05-20-2021

I encountered the same problem in my environment , we use "schedule search"+"| loadjob" in our dashboard for access control to avoid granting user index access.

We seek help from Splunk support, they suggested us to increase the number of max_peer_rep_load, but they didn't know what the number should be increased to, we need to try it by self .

SHC - failed on handle async replicate request

New Case Study Shows the Value of Partnering with Splunk Academic Alliance

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?