Installation

8.0.9 indexers all stuck "batchadding" after upgrading cluster master to 8.2.2

gabriel_vasseur
Contributor

Hi,

I'm trying to upgrade splunk from 8.0.9 to 8.2.2. According to the docs, the upgrade starts with the cluster master. After upgrading the cluster master and removing the maintenance mode, all the indexers are stuck at in the "batchadding" status.

Looking at the logs from one indexer, it goes through a cycle of:

event=addPeer Batch=1/9
...success...
event=addPeer Batch=2/9
...success...
...
event=addPeer Batch=9/9
ERROR Read Timeout...
WARN Master is down! Make sure pass4SymmKey is matching if master is running...
WARN Failed to register with cluster master...
Master is back up!

Rinse and repeat. So basically it talks ok to the cluster master for a while and then get a timeout and starts over.

Any idea what's going on?

I did check the pass4SymmKey and they are the same everywhere, they haven't changed.

Cheers,
Gabriel.

Labels (1)
0 Karma
1 Solution

gabriel_vasseur
Contributor

Turns out the solution is: do not put the cluster master in maintenance mode before upgrading it.

I'm pretty sure I saw in the docs on a previous upgrade that if the maintenance mode wasn't on whilst upgrading the cluster master it could cause problems...

Every upgrade is different I guess!

View solution in original post

0 Karma

edoardo_vicendo
Contributor

I had the same issue in our test environment upgrading from 8.0.5 to 8.2.2.1

One of the 3 Indexer in the cluster was having this message in splunkd.log:

10-22-2021 12:42:49.859 +0200 WARN  CMSlave [2763 indexerPipe] - Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json manager=xxxxxxx:yyyy rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line="Internal Server Error" socket_error="No error" remote_error=Cannot add peer=xxx.xxx.xxx.xxx mgmtport=yyyy (reason: This peer re-added a clustered bucket as standalone and is rejected from joining the cluster). [ event=addPeer status=retrying AddPeerRequest: { _indexVec="iiiiii:event" batch_serialno=10 batch_size=14 guid=AAAAAAAA-1111-1111-AAAA-AAAAAAAAAAAA server_name=aaaaaaa status=Up } Batch 10/14 ].

The effect was being stuck in "batchadding" in the Monitoring Console. The main reason was that during the upgrade, with Cluster Master in maintenance mode, the affected Indexer had an outage due to an issue at storage level, therefore I am not completely sure with the proposed solution that suggest to do not put the Cluster in maintenance mode during the upgrade...

To solve it I followed the steps proposed here:

https://community.splunk.com/t5/Getting-Data-In/Indexer-fails-to-join-back-cluster-due-to-standalone...

Be aware not all the buckets have to be renamed, just the ones that are replicated (for instance in our environment metrics and other specific Splunk indexes are not)

0 Karma

gabriel_vasseur
Contributor

Turns out the solution is: do not put the cluster master in maintenance mode before upgrading it.

I'm pretty sure I saw in the docs on a previous upgrade that if the maintenance mode wasn't on whilst upgrading the cluster master it could cause problems...

Every upgrade is different I guess!

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...