Does anybody happen to know what the following error means and how to resolve it? I linked this back to a saved search via the scheduler log and verified that the expiration of search is 15 minutes so it should be enough time to replicate the data.
04-28-2015 10:11:37.454 -0500 ERROR Fixup - failed to kick off replication from src=FA5DD091-47DE-44C0-BF4C-FF60B8DF4B72 tgt=6458DFA1-C04B-43D2-BCE6-C2D3B5AB74C3 aid=scheduler__jdoe__search__RMD5ebb33acee6403ee2_at_1430233560_69456_15082AA6-AAE2-47B5-BD24-7643F4C96F15 err='src FA5DD091-47DE-44C0-BF4C-FF60B8DF4B72 cannot be valid source for scheduler__jdoe__search__RMD5ebb33acee6403ee2_at_1430233560_69456_15082AA6-AAE2-47B5-BD24-7643F4C96F15'
Thanks in advance.
We have two bug for the errors like “Error Fixup - failed to kick off replication….”
The Bug number are
SPL-94508::Search Head Clustering: Captain's splunkd.log spam ERROR Fixup - failed to kick off replication from src=
tgt= aid= err=...
SPL-98488::SHC - Peers incorrectly report 4 billion replications in high latency environments
i)The Fixup error comes when the" # of outstanding replications" on a peer are higher than the configured "#maxpeerrep_load = 5."
For example if there are 100 scheduled searches running at the 30 minute mark, it could be a possibility that all finished and tried to replicate at the same time. So the Fixup code may throw that error.
Just to clarify on the impact of this Bug, If users hit any of the peers looking for the job it will be proxied instead of read locally. Although this will not impact the access to artifacts, it disables replication which is not good. So Splunk is working to get Bug SPL-98488 fixed at earliest.
Also, Just to clarify that in error messages you see reference to “aid” and “sid”. Note aid is a short-form for artifact id and sid is an artifact when it is managed for replication by the captain. sid's like adhoc searches which are not replicated are thereby not artifacts.
ii) To confirm if you are hitting Bug “SPL-98488“You can diagnose this if you are seeing that message is by going to any node on the SHCluster and doing "splunk list shcluster-members " and look for the value of "replicationcount" for the problematic source. If the replicationcount is very high its the same issue. (Command :./splunk list shcluster-members | grep replication_count )
iii) Can we hit https://:/services/shcluster/captain/replications on the captain node with admin credentials to see what the output is.
Other things to verify…..
when we get a fixup error for an SID, does that SID eventually replicate itself to a replica count in your enviornmnet?In that case the errors are just transient and annoying spam, which we can correct in a maint release. You can validate this by taking the latest error by tailing splunkd.log and then seeing the same artifact after a few minutes in https://:/services/shcluster/captain/artifacts
as these sids/artifacts are getting generated is there any access ( interactive ) happening to these sids?
The splunk change logs says it was fixed in Splunk 6.2.4. I'm using Splunk 6.2.6 and can still see the same error messages. Does anyone know any workaround to this?
Is this issue resolved ? We are in Splunk 6.2.3 (build 264711).
We are seeing continuous fixup - failed to kick off replication on our search head captain.
08-26-2015 10:52:59.037 -0500 ERROR Fixup - failed to kick off replication from src=E865F266-125C-465F-BFF7-10773D2D3536 tgt=EC6D891C-FF0D-47E9-9D83-864D13A58B04 aid=schedulerjmonettecoreapiRMD51ae1d00f2e3ed31aat144060432021214E865F266-125C-465F-BFF7-10773D2D3536 err='src E865F266-125C-465F-BFF7-10773D2D3536 cannot be valid source for schedulerjmonettecoreapiRMD51ae1d00f2e3ed31aat144060432021214E865F266-125C-465F-BFF7-10773D2D3536'