Deployment Architecture

SHC - failed on handle async replicate request

ahartge
Path Finder

I have noticed something odd in a SHC deployment. Im consistently seeing "SHCMasterArtifactHandler - failed on handle async replicate request" errors, these report to be caused by the reason "active replication count >= max_peer_rep_load"

While these errors dont appear to be causing any actual impact on executing scheduled searches or anything else, I would like to get to the bottom of what is causing them. It doesnt seem to be a particular node or search or user that these occur for.

See the end of the post for the relevant error messages.

There are 4x nodes in the multi-site search head cluster (2 in each site)

The [shclustering] stanza of server.conf has replication_factor=2 configured

Where I suspect the problem is occuring is the search heads not implimenting the replication_factor setting, because when I run the command "splunk list shcluster-members" I see the replication_count numbers vary between 3 - 5 - I should expect to see 2 here shouldnt I ?

If the default max_peer_rep_load = 5 and the replication_count of at least one of those search heads is showing 5 when I check them - then I assume this is what is causing the excess replication to occur ?

When I run the command "splunk list shcluster-config" against all the nodes - they correctly show [max_peer_rep_load = 5] and [replication_factor = 2]

Has anyone seen this in a search head cluster before ? - I have read https://answers.splunk.com/answers/242905/shc-troubleshooting-configurations-under-search-he.html and looked into the settings from the last comment from SplunkIT - but want to verify before I go tweaking configs before finding the root cause of the issue.

Thanks in advance for anyone's insight into what may be causing it and how to further troubleshoot / resolve.

10-15-2019 09:50:15.153 +0000 ERROR SHCMasterArtifactHandler - failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'

10-15-2019 09:50:15.159 +0000 ERROR SHCRepJob - job=SHPAsyncReplicationJob aid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B srcGuid=4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B tgtGuid=7A3F5991-2373-40BA-998A-79193A40CF27 failed. reason failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"

10-15-2019 09:50:15.159 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"

BainM
Communicator

I guess no one else is experiencing this? I suspect that a complete destructive resynch of the entire cluster is in order.

0 Karma

andsov
Explorer

Did you ever find a solution to this? 

I am experiencing the same problem in my environment. It seems to be connected to some searches which is using a "nested setup" in the following way: 

1. Summary indexing search is storing events to a new index. 
2. Statistics from the new index is saved in reports. 
3. The reports are used as the foundation for new searches to populate the panels in dashboards ( through | loadjob).  


0 Karma

liuce1
Explorer

I encountered the same problem in my environment , we use "schedule search"+"| loadjob" in our dashboard for access control to avoid granting user index access.

We seek help from Splunk support, they suggested us to increase the number of max_peer_rep_load,  but they didn't know what the number should be increased to, we need to try it by self .

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...