Hi,
I'm trying to upgrade splunk from 8.0.9 to 8.2.2. According to the docs, the upgrade starts with the cluster master. After upgrading the cluster master and removing the maintenance mode, all the indexers are stuck at in the "batchadding" status.
Looking at the logs from one indexer, it goes through a cycle of:
event=addPeer Batch=1/9
...success...
event=addPeer Batch=2/9
...success...
...
event=addPeer Batch=9/9
ERROR Read Timeout...
WARN Master is down! Make sure pass4SymmKey is matching if master is running...
WARN Failed to register with cluster master...
Master is back up!
Rinse and repeat. So basically it talks ok to the cluster master for a while and then get a timeout and starts over.
Any idea what's going on?
I did check the pass4SymmKey and they are the same everywhere, they haven't changed.
Cheers,
Gabriel.
Turns out the solution is: do not put the cluster master in maintenance mode before upgrading it.
I'm pretty sure I saw in the docs on a previous upgrade that if the maintenance mode wasn't on whilst upgrading the cluster master it could cause problems...
Every upgrade is different I guess!
I had the same issue in our test environment upgrading from 8.0.5 to 8.2.2.1
One of the 3 Indexer in the cluster was having this message in splunkd.log:
10-22-2021 12:42:49.859 +0200 WARN CMSlave [2763 indexerPipe] - Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json manager=xxxxxxx:yyyy rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line="Internal Server Error" socket_error="No error" remote_error=Cannot add peer=xxx.xxx.xxx.xxx mgmtport=yyyy (reason: This peer re-added a clustered bucket as standalone and is rejected from joining the cluster). [ event=addPeer status=retrying AddPeerRequest: { _indexVec="iiiiii:event" batch_serialno=10 batch_size=14 guid=AAAAAAAA-1111-1111-AAAA-AAAAAAAAAAAA server_name=aaaaaaa status=Up } Batch 10/14 ].
The effect was being stuck in "batchadding" in the Monitoring Console. The main reason was that during the upgrade, with Cluster Master in maintenance mode, the affected Indexer had an outage due to an issue at storage level, therefore I am not completely sure with the proposed solution that suggest to do not put the Cluster in maintenance mode during the upgrade...
To solve it I followed the steps proposed here:
Be aware not all the buckets have to be renamed, just the ones that are replicated (for instance in our environment metrics and other specific Splunk indexes are not)
Turns out the solution is: do not put the cluster master in maintenance mode before upgrading it.
I'm pretty sure I saw in the docs on a previous upgrade that if the maintenance mode wasn't on whilst upgrading the cluster master it could cause problems...
Every upgrade is different I guess!