Deployment Architecture

SHC - failed on handle async replicate request

ahartge
Path Finder

I have noticed something odd in a SHC deployment. Im consistently seeing "SHCMasterArtifactHandler - failed on handle async replicate request" errors, these report to be caused by the reason "active replication count >= max_peer_rep_load"

While these errors dont appear to be causing any actual impact on executing scheduled searches or anything else, I would like to get to the bottom of what is causing them. It doesnt seem to be a particular node or search or user that these occur for.

See the end of the post for the relevant error messages.

There are 4x nodes in the multi-site search head cluster (2 in each site)

The [shclustering] stanza of server.conf has replication_factor=2 configured

Where I suspect the problem is occuring is the search heads not implimenting the replication_factor setting, because when I run the command "splunk list shcluster-members" I see the replication_count numbers vary between 3 - 5 - I should expect to see 2 here shouldnt I ?

If the default max_peer_rep_load = 5 and the replication_count of at least one of those search heads is showing 5 when I check them - then I assume this is what is causing the excess replication to occur ?

When I run the command "splunk list shcluster-config" against all the nodes - they correctly show [max_peer_rep_load = 5] and [replication_factor = 2]

Has anyone seen this in a search head cluster before ? - I have read https://answers.splunk.com/answers/242905/shc-troubleshooting-configurations-under-search-he.html and looked into the settings from the last comment from SplunkIT - but want to verify before I go tweaking configs before finding the root cause of the issue.

Thanks in advance for anyone's insight into what may be causing it and how to further troubleshoot / resolve.

10-15-2019 09:50:15.153 +0000 ERROR SHCMasterArtifactHandler - failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'

10-15-2019 09:50:15.159 +0000 ERROR SHCRepJob - job=SHPAsyncReplicationJob aid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B srcGuid=4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B tgtGuid=7A3F5991-2373-40BA-998A-79193A40CF27 failed. reason failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"

10-15-2019 09:50:15.159 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"

BainM
Communicator

I guess no one else is experiencing this? I suspect that a complete destructive resynch of the entire cluster is in order.

0 Karma

andsov
Explorer

Did you ever find a solution to this? 

I am experiencing the same problem in my environment. It seems to be connected to some searches which is using a "nested setup" in the following way: 

1. Summary indexing search is storing events to a new index. 
2. Statistics from the new index is saved in reports. 
3. The reports are used as the foundation for new searches to populate the panels in dashboards ( through | loadjob).  


0 Karma

liuce1
Explorer

I encountered the same problem in my environment , we use "schedule search"+"| loadjob" in our dashboard for access control to avoid granting user index access.

We seek help from Splunk support, they suggested us to increase the number of max_peer_rep_load,  but they didn't know what the number should be increased to, we need to try it by self .

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...