-Splunk is re-starting almost twice every 60 seconds
-Prior to every restart I see the below error message in splunkd.log
03-12-2019 15:54:57.504 +0800 INFO SHCMaster - event=SHPMaster::SHPMaster Long running job seconds = 600
03-12-2019 15:54:57.505 +0800 INFO ServerRoles - Declared role=shc_member.
03-12-2019 15:54:57.505 +0800 INFO SHClusterMgr - makeOrChangeSlave - master_shp = ?
03-12-2019 15:54:57.505 +0800 INFO SHClusterMgr - Initializing node as pool member
03-12-2019 15:54:58.714 +0800 INFO loader - Downloaded new baseline configuration; restarting ...
To fix we tried :
-Backup kvstore : splunk backup kvstore
-Stop the member( splunk stop).
-Run splunk clean kvstore --local
-Clean the member's raft folder:
Splunk clean raft
-Restart the member. This triggers the initial synchronization from other KV store members.
Splunk started but we started see new errors/issue :
1. Web interface cannot be start
2. “Couldn’t complete HTTP request: Connection reset by peer” seem like appear anywhere, when I run any splunk command
-Run splunk show kvstore-status to verify synchronization
-Run splunk show shcluster-status to verify cluster status
In our case the Splunk instance on which Splunk was restarting frequently was Search head cluster member.
We removed the Search head cluster member from node and we noticed that Splunk was starting with no issues and was stable
As soon as we added the node back to cluster we started seeing that search head cluster member restarting Splunk instance.
After removing and adding Search head member we noticed the below error in logs :
splunkd.log:03-21-2019 12:52:21.488 +0800 ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/z.z.z.z.com-1553143774.bundle" has transferred_checksum=13796498114204145188 instead of checksum=12534329482337580946
As Splunk service was starting to frequently the above memory got rolled over earlier and only after removing and adding search head member back to cluster we noticed the above error and found that the root cause.
To fix the issue temporarily we updated server.conf :
1. We updated server.conf : server.conf [shclustering] conf_deploy_fetch_mode = none
2. Restarted Splunk : Server came up fine.
3. We were able to add the server to cluster back successfully.
Long term fixed the bundle that was causing this issue.
In our case the Splunk instance on which Splunk was restarting frequently was Search head cluster member.
We removed the Search head cluster member from node and we noticed that Splunk was starting with no issues and was stable
As soon as we added the node back to cluster we started seeing that search head cluster member restarting Splunk instance.
After removing and adding Search head member we noticed the below error in logs :
splunkd.log:03-21-2019 12:52:21.488 +0800 ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/z.z.z.z.com-1553143774.bundle" has transferred_checksum=13796498114204145188 instead of checksum=12534329482337580946
As Splunk service was starting to frequently the above memory got rolled over earlier and only after removing and adding search head member back to cluster we noticed the above error and found that the root cause.
To fix the issue temporarily we updated server.conf :
1. We updated server.conf : server.conf [shclustering] conf_deploy_fetch_mode = none
2. Restarted Splunk : Server came up fine.
3. We were able to add the server to cluster back successfully.
Long term fixed the bundle that was causing this issue.
In our case the Splunk instance on which Splunk was restarting frequently was Search head cluster member.
After removing and adding Search head member we noticed the below error in logs :
splunkd.log:03-21-2019 12:52:21.488 +0800 ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/z.z.z.z.com-1553143774.bundle" has transferred_checksum=13796498114204145188 instead of checksum=12534329482337580946
As Splunk service was starting to frequently the above memory got rolled over earlier and only after removing and adding search head member back to cluster we noticed the above error and found that the root cause.
To fix the issue temporarily we updated server.conf :
1. We updated server.conf : server.conf [shclustering] conf_deploy_fetch_mode = none
2. Restarted Splunk : Server came up fine.
3. We were able to add the server to cluster back successfully.
Long term fixed the bundle that was causing this issue.