Deployment Architecture

Replication errors : having artifact replication issues between the SHs

sylim_splunk
Splunk Employee
Splunk Employee

In SHC with the version 8.2.10, from time to time we found this type of ERROR messages from SHCRepJob as below;

- splunkd.log from a SHC member

05-24-2023 17:39:31.941 +0000 ERROR SHCRepJob [54418 SHPPushExecutorWorker-0] - failed job=SHPRepJob peer="<PEER1 FQDN>", guid="PEER1C47-1E44-48A0-A0F2-35DE6E449C65" aid=1684949135.77748_B2392C47-1E44-48A0-A0F2-35DE6E449C65, tgtPeer="<PEER2 FQDN>", tgtGuid="PEER2D44-E56B-4ABA-822A-4C40ACF1E484", tgtRP=<ReplicationPort>, useSSL=false tgt_hp=10.9.129.18:8089 tgt_guid=PEER2D44-E56B-4ABA-822A-4C40ACF1E484 err=uri=https://PEER1:8089/services/shcluster/member/artifacts/1684949135.77748_PEER1C47-1E44-48A0-A0F2-35DE..., error=500 - Failed to trigger replication (artifact='1684949135.77748_PEER1C47-1E44-48A0-A0F2-35DE6E449C65') (err='event=SHPSlave::replicateArtifactTo invalid status=alive to be a source for replication')

We used to have bundle replication issues but searches appear to be running and completing as expected. Is this something to worry or why does this happen?

 

Labels (2)
0 Karma
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee
Here are my findings.
 
i) This has been there for a while even before the 8.2.10 upgrade - I checked some logs from old diags, it goes as far back with 8.1.x.
 
ii) In SHC environment, we recommend you to use source IP stickiness for any Load Balancers so that you log in to SH1 and launch a search on SH1, then gets updates through SH1.
   However, it appears you requested search preview updates from another Search Head, for example, SH2 which would trigger and request SH1 for your search updates but, as it was then still running, SH1 would have put errors  as below;
 
----- From SH1, IP Addr: 10.9.160.139 , error for the proxy request for a search that is still running --
05-24-2023 13:33:49.894 -0700 ERROR SHCSlaveArtifactHandler [222579 TcpChannelThread] - Failed to trigger replication (artifact='1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B') (err='event=SHPSlave::replicateArtifactTo invalid status=alive to be a source for replication')
-----
 
iii) Logs proves the scenario :
 
 1. The search is an adhoc with SID, "1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B"  by "admin" and found in the search artifacts/dispatch dir;
 
$ ls -l $SPLUNK_HOME/var/run/splunk/dispatch/1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B/
-rwxr-xr-x 1 support support    242 May 24 20:33 args.txt
... <SNIP>
-rwxr-xr-x 1 support support 828809 May 24 20:34 search.log
 
 
 
2. From splunkd_ui_access.log : User, "admin" started the search with SID: 1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B,
 
-- splunkd_ui_access.log --
10.6.248.0 - admin  [24/May/2023:13:33:05.139 -0700] "GET /en-US/splunkd/__raw/servicesNS/admin/search/search/jobs/1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B?output_mode=json&_=1684959918754 HTTP/1.1" 200 1147 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36" - 0dd2b80d72da3ed97e640b9101f0a698 5ms
 
3. Search string from SID:684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B
-- search.log --
05-24-2023 13:33:05.302 INFO  SearchParser [223948 RunDispatch] - PARSING: search index=apps sourcetype=api <SNIP>...ion  as ApplicationName | search  name!=default  pname!=zz-default | stats count by AppName pname | table AppName pname
 
 
4. 2s after the search started, a replicate request came in from 10.9.129.34 which is SH2 but immediately returned an error, 500 as it's not finished.
This proxy request for the search results keeps coming in every second but always gets the error, 500 as it was not completed.
 
 -- splunkd_access.log --
10.9.129.34 - splunk-system-user [24/May/2023:13:33:07.016 -0700] "POST /services/shcluster/member/artifacts/1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B/replicate?output_mode=json HTTP/1.1" 500 231 "-" "Splunk/8.2.10 (Linux 3.10.0-1160.49.1.el7.x86_64; arch=x86_64)" - 2ms
 
Search results preview request should hit the same SH as the search was launched but it appears to routed to different Search head.
 
Recommendation:  need to check with your LB or DNS team, such as whether LB has source IP Stickiness set or if you use DNS roundrobin between the multiple LBs  in front of SHC, that uses same FQDN or URL as this will lead the users to different LBs or Search heads..

View solution in original post

Tags (1)
0 Karma

sylim_splunk
Splunk Employee
Splunk Employee
Here are my findings.
 
i) This has been there for a while even before the 8.2.10 upgrade - I checked some logs from old diags, it goes as far back with 8.1.x.
 
ii) In SHC environment, we recommend you to use source IP stickiness for any Load Balancers so that you log in to SH1 and launch a search on SH1, then gets updates through SH1.
   However, it appears you requested search preview updates from another Search Head, for example, SH2 which would trigger and request SH1 for your search updates but, as it was then still running, SH1 would have put errors  as below;
 
----- From SH1, IP Addr: 10.9.160.139 , error for the proxy request for a search that is still running --
05-24-2023 13:33:49.894 -0700 ERROR SHCSlaveArtifactHandler [222579 TcpChannelThread] - Failed to trigger replication (artifact='1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B') (err='event=SHPSlave::replicateArtifactTo invalid status=alive to be a source for replication')
-----
 
iii) Logs proves the scenario :
 
 1. The search is an adhoc with SID, "1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B"  by "admin" and found in the search artifacts/dispatch dir;
 
$ ls -l $SPLUNK_HOME/var/run/splunk/dispatch/1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B/
-rwxr-xr-x 1 support support    242 May 24 20:33 args.txt
... <SNIP>
-rwxr-xr-x 1 support support 828809 May 24 20:34 search.log
 
 
 
2. From splunkd_ui_access.log : User, "admin" started the search with SID: 1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B,
 
-- splunkd_ui_access.log --
10.6.248.0 - admin  [24/May/2023:13:33:05.139 -0700] "GET /en-US/splunkd/__raw/servicesNS/admin/search/search/jobs/1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B?output_mode=json&_=1684959918754 HTTP/1.1" 200 1147 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36" - 0dd2b80d72da3ed97e640b9101f0a698 5ms
 
3. Search string from SID:684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B
-- search.log --
05-24-2023 13:33:05.302 INFO  SearchParser [223948 RunDispatch] - PARSING: search index=apps sourcetype=api <SNIP>...ion  as ApplicationName | search  name!=default  pname!=zz-default | stats count by AppName pname | table AppName pname
 
 
4. 2s after the search started, a replicate request came in from 10.9.129.34 which is SH2 but immediately returned an error, 500 as it's not finished.
This proxy request for the search results keeps coming in every second but always gets the error, 500 as it was not completed.
 
 -- splunkd_access.log --
10.9.129.34 - splunk-system-user [24/May/2023:13:33:07.016 -0700] "POST /services/shcluster/member/artifacts/1684960384.101_13E3A0F-27AE-49C5-9FB2-23862EDB224B/replicate?output_mode=json HTTP/1.1" 500 231 "-" "Splunk/8.2.10 (Linux 3.10.0-1160.49.1.el7.x86_64; arch=x86_64)" - 2ms
 
Search results preview request should hit the same SH as the search was launched but it appears to routed to different Search head.
 
Recommendation:  need to check with your LB or DNS team, such as whether LB has source IP Stickiness set or if you use DNS roundrobin between the multiple LBs  in front of SHC, that uses same FQDN or URL as this will lead the users to different LBs or Search heads..
Tags (1)
0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...