Using Splunk 7.3.3, after I initiated a rolling restart from the cluster master (multi-site indexer cluster), the first indexer began to restart. Then it showed batch adding, then the Indexer Clustering: Master Node page, showed that the indexer failed to restart
[Mon Feb 2 12:47:52 2020] Failed to restart peer=<GUID> peer_name=<hostname>. Moving to failed peer group and continuing.
[Mon Feb 2 12:47:52 2020] Failing peer=<GUID> peer_name=<hostname> timed out while trying to restart.
I did a ping from the CM to this indexer and it returned fine. Connectivity was not an issue before the rolling restart and network connectivity appears to be working fine.
Hi, you are probably looking at the restart timeout setting on the CM (see link text and link text)
[clustering]
restart_timeout = time_in_sec
# default is 60, probably a good idea to really increase here (to avoid the cluster to go in fix mode) but still adapt it to the time it usually take for a idx to restart (use something like 3600 if you really want not to restart in that case but obviously if your idx crash in the middle of the restart, this will take more time to detect)
From the error, if the indexer did restart without manual intervention, I would guess that the restart of the indexer took longer than the restart_timeout defined in the cluster master's server.conf. By default this is set to 60 seconds, and I have seen indexers take much longer than this to restart.
Can you see from splunkd.log on the indexer how long the restart actually took? If it's longer than 60 seconds, then you might want to extend your restart_timeout (https://docs.splunk.com/Documentation/Splunk/7.3.3/Indexer/Userollingrestart#Handle_slow_restarts)
Most indexers were taking 15-20 mins. I will try adjusting the restart_timeout
value but this is the first time I've seen these errors and I have restarted this cluster many times with each taking 15-20 mins just like always. That's what prompted me to ask about this issue.
So this setting needs to be changed on the CM's server.conf, not the indexers themselves?
Hi, you are probably looking at the restart timeout setting on the CM (see link text and link text)
[clustering]
restart_timeout = time_in_sec
# default is 60, probably a good idea to really increase here (to avoid the cluster to go in fix mode) but still adapt it to the time it usually take for a idx to restart (use something like 3600 if you really want not to restart in that case but obviously if your idx crash in the middle of the restart, this will take more time to detect)
I will try adjusting this. Each idx takes on average 15-20 mins, my current timeout setting is 15mins, so maybe I just expand it to 30m to be safe?
Yes, it is a CM setting. 30 min(1800s) seem to be appropriate for your env.
Just finished a rolling restart and no errors anymore after increasing the timeout to 30mins. Thank you both for the assistance!