I have noticed something odd in a SHC deployment. Im consistently seeing "SHCMasterArtifactHandler - failed on handle async replicate request" errors, these report to be caused by the reason "active replication count >= max_peer_rep_load"
While these errors dont appear to be causing any actual impact on executing scheduled searches or anything else, I would like to get to the bottom of what is causing them. It doesnt seem to be a particular node or search or user that these occur for.
See the end of the post for the relevant error messages.
There are 4x nodes in the multi-site search head cluster (2 in each site)
The [shclustering] stanza of server.conf has replication_factor=2 configured
Where I suspect the problem is occuring is the search heads not implimenting the replication_factor setting, because when I run the command "splunk list shcluster-members" I see the replication_count numbers vary between 3 - 5 - I should expect to see 2 here shouldnt I ?
If the default max_peer_rep_load = 5 and the replication_count of at least one of those search heads is showing 5 when I check them - then I assume this is what is causing the excess replication to occur ?
When I run the command "splunk list shcluster-config" against all the nodes - they correctly show [max_peer_rep_load = 5] and [replication_factor = 2]
Has anyone seen this in a search head cluster before ? - I have read https://answers.splunk.com/answers/242905/shc-troubleshooting-configurations-under-search-he.html and looked into the settings from the last comment from SplunkIT - but want to verify before I go tweaking configs before finding the root cause of the issue.
Thanks in advance for anyone's insight into what may be causing it and how to further troubleshoot / resolve.
10-15-2019 09:50:15.153 +0000 ERROR SHCMasterArtifactHandler - failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'
10-15-2019 09:50:15.159 +0000 ERROR SHCRepJob - job=SHPAsyncReplicationJob aid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B srcGuid=4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B tgtGuid=7A3F5991-2373-40BA-998A-79193A40CF27 failed. reason failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"
10-15-2019 09:50:15.159 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/artifacts/scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=<SHC_MASTER>:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="<response>\n <messages>\n <msg type="ERROR">failed on handle async replicate request sid=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="<SHC_SRC_PEER>", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler__<USERNAME>_aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="<SHC_TARGET_PEER>", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'</msg>\n </messages>\n</response>\n"
I guess no one else is experiencing this? I suspect that a complete destructive resynch of the entire cluster is in order.
Did you ever find a solution to this?
I am experiencing the same problem in my environment. It seems to be connected to some searches which is using a "nested setup" in the following way:
1. Summary indexing search is storing events to a new index.
2. Statistics from the new index is saved in reports.
3. The reports are used as the foundation for new searches to populate the panels in dashboards ( through | loadjob).
I encountered the same problem in my environment , we use "schedule search"+"| loadjob" in our dashboard for access control to avoid granting user index access.
We seek help from Splunk support, they suggested us to increase the number of max_peer_rep_load, but they didn't know what the number should be increased to, we need to try it by self .