Deployment Architecture
Highlighted

SHC - failed on handle async replicate request

Path Finder

I have noticed something odd in a SHC deployment. Im consistently seeing "SHCMasterArtifactHandler - failed on handle async replicate request" errors, these report to be caused by the reason "active replication count >= maxpeerrep_load"

While these errors dont appear to be causing any actual impact on executing scheduled searches or anything else, I would like to get to the bottom of what is causing them. It doesnt seem to be a particular node or search or user that these occur for.

See the end of the post for the relevant error messages.

There are 4x nodes in the multi-site search head cluster (2 in each site)

The [shclustering] stanza of server.conf has replication_factor=2 configured

Where I suspect the problem is occuring is the search heads not implimenting the replicationfactor setting, because when I run the command "splunk list shcluster-members" I see the replicationcount numbers vary between 3 - 5 - I should expect to see 2 here shouldnt I ?

If the default maxpeerrepload = 5 and the replicationcount of at least one of those search heads is showing 5 when I check them - then I assume this is what is causing the excess replication to occur ?

When I run the command "splunk list shcluster-config" against all the nodes - they correctly show [maxpeerrepload = 5] and [replicationfactor = 2]

Has anyone seen this in a search head cluster before ? - I have read https://answers.splunk.com/answers/242905/shc-troubleshooting-configurations-under-search-he.html and looked into the settings from the last comment from SplunkIT - but want to verify before I go tweaking configs before finding the root cause of the issue.

Thanks in advance for anyone's insight into what may be causing it and how to further troubleshoot / resolve.

10-15-2019 09:50:15.153 +0000 ERROR SHCMasterArtifactHandler - failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'

10-15-2019 09:50:15.159 +0000 ERROR SHCRepJob - job=SHPAsyncReplicationJob aid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B srcGuid=4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B tgtGuid=7A3F5991-2373-40BA-998A-79193A40CF27 failed. reason failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"

10-15-2019 09:50:15.159 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"
Highlighted

Re: SHC - failed on handle async replicate request

Communicator

I guess no one else is experiencing this? I suspect that a complete destructive resynch of the entire cluster is in order.

0 Karma