Deployment Architecture

Why is Splunk service restarting twice every minutes on Search head cluster member?

sdubey_splunk
Splunk Employee
Splunk Employee

-Splunk is re-starting almost twice every 60 seconds
-Prior to every restart I see the below error message in splunkd.log
03-12-2019 15:54:57.504 +0800 INFO SHCMaster - event=SHPMaster::SHPMaster Long running job seconds = 600
03-12-2019 15:54:57.505 +0800 INFO ServerRoles - Declared role=shc_member.
03-12-2019 15:54:57.505 +0800 INFO SHClusterMgr - makeOrChangeSlave - master_shp = ?
03-12-2019 15:54:57.505 +0800 INFO SHClusterMgr - Initializing node as pool member
03-12-2019 15:54:58.714 +0800 INFO loader - Downloaded new baseline configuration; restarting ...

To fix we tried :
-Backup kvstore : splunk backup kvstore
-Stop the member( splunk stop).
-Run splunk clean kvstore --local
-Clean the member's raft folder:
Splunk clean raft
-Restart the member. This triggers the initial synchronization from other KV store members.
Splunk started but we started see new errors/issue :
1. Web interface cannot be start
2. “Couldn’t complete HTTP request: Connection reset by peer” seem like appear anywhere, when I run any splunk command 
-Run splunk show kvstore-status to verify synchronization
-Run splunk show shcluster-status to verify cluster status

Tags (1)
0 Karma
1 Solution

sdubey_splunk
Splunk Employee
Splunk Employee

In our case the Splunk instance on which Splunk was restarting frequently was Search head cluster member.

We removed the Search head cluster member from node and we noticed that Splunk was starting with no issues and was stable

As soon as we added the node back to cluster we started seeing that search head cluster member restarting Splunk instance.

After removing and adding Search head member we noticed the below error in logs :

splunkd.log:03-21-2019 12:52:21.488 +0800 ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/z.z.z.z.com-1553143774.bundle" has transferred_checksum=13796498114204145188 instead of checksum=12534329482337580946

As Splunk service was starting to frequently the above memory got rolled over earlier and only after removing and adding search head member back to cluster we noticed the above error and found that the root cause.

To fix the issue temporarily we updated server.conf :
1. We updated server.conf : server.conf [shclustering] conf_deploy_fetch_mode = none
2. Restarted Splunk : Server came up fine.
3. We were able to add the server to cluster back successfully.

Long term fixed the bundle that was causing this issue.

View solution in original post

0 Karma

sdubey_splunk
Splunk Employee
Splunk Employee

In our case the Splunk instance on which Splunk was restarting frequently was Search head cluster member.

We removed the Search head cluster member from node and we noticed that Splunk was starting with no issues and was stable

As soon as we added the node back to cluster we started seeing that search head cluster member restarting Splunk instance.

After removing and adding Search head member we noticed the below error in logs :

splunkd.log:03-21-2019 12:52:21.488 +0800 ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/z.z.z.z.com-1553143774.bundle" has transferred_checksum=13796498114204145188 instead of checksum=12534329482337580946

As Splunk service was starting to frequently the above memory got rolled over earlier and only after removing and adding search head member back to cluster we noticed the above error and found that the root cause.

To fix the issue temporarily we updated server.conf :
1. We updated server.conf : server.conf [shclustering] conf_deploy_fetch_mode = none
2. Restarted Splunk : Server came up fine.
3. We were able to add the server to cluster back successfully.

Long term fixed the bundle that was causing this issue.

0 Karma

sdubey_splunk
Splunk Employee
Splunk Employee

In our case the Splunk instance on which Splunk was restarting frequently was Search head cluster member.

  1. We removed the Search head cluster member from node and we noticed that Splunk was starting with no issues and was stable
  2. As soon as we added the node back to cluster we started seeing that search head cluster member restarting Splunk instance.

After removing and adding Search head member we noticed the below error in logs :

splunkd.log:03-21-2019 12:52:21.488 +0800 ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/z.z.z.z.com-1553143774.bundle" has transferred_checksum=13796498114204145188 instead of checksum=12534329482337580946 

As Splunk service was starting to frequently the above memory got rolled over earlier and only after removing and adding search head member back to cluster we noticed the above error and found that the root cause.

To fix the issue temporarily we updated server.conf :
1. We updated server.conf :  server.conf [shclustering] conf_deploy_fetch_mode = none 
2. Restarted Splunk : Server came up fine.
3. We were able to add the server to cluster back successfully.

Long term fixed the bundle that was causing this issue.

Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...