Deployment Architecture

Why is our search head cluster scheduler failing following deployment or rolling restart?

duncangoff
Engager

We have a problem with the scheduler failing following a search head cluster (SHC) deployment, which is resolved only if we manually change the captain following the deployment. This is not an ideal solution, and we want to sort out the root cause.

Following last nights deployment, we saw the following sequence of events (mostly from the debug logs);

SHC Rolling Restart begins...All peers told to close down their searches in turn...Restarts complete normally with no error...

Then, Captain tells peers to remove artifacts "DEBUG SHCMaster - remove artifact aid=scheduler~" Most work fine, but two fail with the following errors;

"DEBUG SHCMaster - event=SHPMaster::asyncReplicationArtifact sid=154~ status=failed msg=sid is not an artifact but a remote search job "
"DEBUG SHCMaster - event=SHPMaster::asyncReplicationArtifact aid=154~ status=failed msg="Could not find artifact or sid"

From then on, the scheduler keeps repeating these errors and no scheduler searches, accelerations, alerts etc run until the captain is transferred.

Couldn't tell you if this is a symptom or cause. I can hazard a guess something went wrong with those searches, but what? And how do we stop it happening?

0 Karma

lakshman239
Influencer

Looks to me that following deployment/restart the captain election is not happing. have you tried clearing the RAFT status? Also, you would need to ensure the health of the KVstore across members is good. Also, look at the monitoring console for any issues from the SH members. https://docs.splunk.com/Documentation/Splunk/7.2.3/DistSearch/Handleraftissues

0 Karma

duncangoff
Engager

The Captain election happens fine with no issues, same for KV store

0 Karma
Get Updates on the Splunk Community!

Deep Dive into Federated Analytics: Unlocking the Full Power of Your Security Data

In today’s complex digital landscape, security teams face increasing pressure to protect sprawling data across ...

Your summer travels continue with new course releases

Summer in the Northern hemisphere is in full swing, and is often a time to travel and explore. If your summer ...

From Alert to Resolution: How Splunk Observability Helps SREs Navigate Critical ...

It's 3:17 AM, and your phone buzzes with an urgent alert. Wire transfer processing times have spiked, and ...