Getting Data In

Rolling restart of Cluster puts peer in restart loop

rturk
Builder

Hi All,

After fresh installs of Splunk (Windows v5.0.4) I had (had) a fully functioning cluster that was happily replicating and life was good.

After updating an app on the cluster master (removing extraneous text files from a directory) I kicked of the bundle deployment:

.\splunk.exe apply cluster-bundle

I then checked the status with the following command:

.\splunk.exe show cluster-bundle-status

Output:

Guid: 71F63992-BD86-4935-932E-24258A6A3CDD
  ServerName: IDX-A
  Status: Up
  Bundle Validation Status: Validation successful
  Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
  Active Bundle: 37b2f885aeac2bbe59bfa95a7a4202fc

Guid: BC734690-BACE-41CC-812D-254085234EE5
  ServerName: IDX-B
  Status: Restarting
  Bundle Validation Status: Validation successful
  Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
  Active Bundle: 37b2f885aeac2bbe59bfa95a7a4202fc

All well and good, but when I checked again not long after:

Guid: 71F63992-BD86-4935-932E-24258A6A3CDD
  ServerName: IDX-A
  Status: Restarting
  Bundle Validation Status: Validation successful
  Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
  Active Bundle:

Guid: BC734690-BACE-41CC-812D-254085234EE5
  ServerName: IDX-B
  Status: Up
  Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
  Active Bundle: 1d6134c6cab9fd5a720516d8881a01a8

The impact of this is:

  • The Active Bundle for IDX-A is now blank
  • The app directories in /slave-apps are now empty
  • IDX-A is in a restart loop, and;
  • The splunkd.log on IDX-A indicate that the process is being told (repeatedly) to gracefully shut down.

This is not the first time this has happened... as this fresh install is a result of this happening previously and me taking the default "Reinstall & hope for the best" path... dammit.

Any and all suggestions greatly appreciated!

RT

EDIT #1: 10 minutes later and it's still happening.

EDIT #2: splunkd.log on the cluster master has this over & over again:

...
CMMaster - event=handleInputsQuiesced guid=71F63992-BD86-4935-932E-24258A6A3CDD
ClusterMasterPeerHandler - Add peer info replication_address=IDX-A forwarder_address= search_address= mgmtPort=8089 rawPort=9887 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=IDX-A activeBundleId= status=Up type=Initial-Add baseGen=0
CMMaster - event=removeOldPeer guid=71F63992-BD86-4935-932E-24258A6A3CDD hostport=IDX-A:8089 status=success
CMMaster - event=addPeer guid=71F63992-BD86-4935-932E-24258A6A3CDD replication_address=IDX-A forwarder_address= search_address= mgmtPort=8089 rawPort=9887 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=SE02SPL01LP activeBundleId= status=Up type=Initial-Add baseGen=0 bucket_count=0 
CMPeer - peer=71F63992-BD86-4935-932E-24258A6A3CDD transitioning from=Down to=Up reason="addPeer successful."
CMMaster - event=addPeer msg='Bundle mismatch; restarting peer. '
CMMaster - committing gen=121 numpeers=2
CMMaster - event=addPeer guid=71F63992-BD86-4935-932E-24258A6A3CDD status=success initialized=1 npeers=2 basegen=121
CMPeer - peer=71F63992-BD86-4935-932E-24258A6A3CDD transitioning from=Up to=Restarting reason="restart peer"
CMBundleServer - event=streamingbundle status=success file=C:\Program Files\Splunk\var\run\splunk\cluster\remote-bundle\4a483d66a10ab4976b2d984c9361d040-1382573311.bundle totalBytesWritten=3317760 checksum=1d6134c6cab9fd5a720516d8881a01a8 Content-Length=3317760
ClusterSlaveControlHandler - Bundle validation success reported by [71F63992-BD86-4935-932E-24258A6A3CDD] successful for bundleid=1d6134c6cab9fd5a720516d8881a01a8
CMMaster - event=handleShutdown guid=71F63992-BD86-4935-932E-24258A6A3CDD status=Restarting
CMPeer - peer=71F63992-BD86-4935-932E-24258A6A3CDD has started master-initiated restart
...
1 Solution

rturk
Builder

Found the cause and solution here: http://answers.splunk.com/answers/82275/why-is-my-windows-cluster-peer-node-continually-restarting

Essentially, directory permissions on /slave-apps/ on the search peer had been lost (why?) and directory was set to read only. As per the link above, resetting the permissions allowed the Cluster Master to once again populate the directory with the required apps.

View solution in original post

0 Karma

rturk
Builder

Found the cause and solution here: http://answers.splunk.com/answers/82275/why-is-my-windows-cluster-peer-node-continually-restarting

Essentially, directory permissions on /slave-apps/ on the search peer had been lost (why?) and directory was set to read only. As per the link above, resetting the permissions allowed the Cluster Master to once again populate the directory with the required apps.

0 Karma
Get Updates on the Splunk Community!

Thanks for the Memories! Splunk University, .conf24, and Community Connections

Thank you to everyone in the Splunk Community who joined us for .conf24 – starting with Splunk University and ...

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

 (view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...