Solved: Re: Multi-site Indexer rolling restart - indexer f...

DEAD_BEEF · ‎02-02-2020

Using Splunk 7.3.3, after I initiated a rolling restart from the cluster master (multi-site indexer cluster), the first indexer began to restart. Then it showed batch adding, then the Indexer Clustering: Master Node page, showed that the indexer failed to restart

[Mon Feb  2 12:47:52 2020] Failed to restart peer=<GUID> peer_name=<hostname>. Moving to failed peer group and continuing.
[Mon Feb  2 12:47:52 2020] Failing peer=<GUID> peer_name=<hostname> timed out while trying to restart.

I did a ping from the CM to this indexer and it returned fine. Connectivity was not an issue before the rolling restart and network connectivity appears to be working fine.

Is there a timeout window or setting I can adjust to better accommodate network latency and give the CM more time to reach the peer?
What does this mean for my rolling restart, will remaining peers be restarted but I should restart this one manually?
How can I list this "failed peer group" to see all systems that may fail to restart?

maraman_splunk · ‎02-03-2020

Hi, you are probably looking at the restart timeout setting on the CM (see link text and link text)

[clustering]
restart_timeout = time_in_sec 
# default is 60, probably a good idea to really increase here (to avoid the cluster to go in fix mode)  but still adapt it to the time it usually take for a idx to restart (use something like 3600 if you really want not to restart in that case but obviously if your idx crash in the middle of the restart, this will take more time to detect)

View solution in original post

hmallett · ‎02-03-2020

From the error, if the indexer did restart without manual intervention, I would guess that the restart of the indexer took longer than the restart_timeout defined in the cluster master's server.conf. By default this is set to 60 seconds, and I have seen indexers take much longer than this to restart.

Can you see from splunkd.log on the indexer how long the restart actually took? If it's longer than 60 seconds, then you might want to extend your restart_timeout (https://docs.splunk.com/Documentation/Splunk/7.3.3/Indexer/Userollingrestart#Handle_slow_restarts)

DEAD_BEEF · ‎02-05-2020

Most indexers were taking 15-20 mins. I will try adjusting the restart_timeout value but this is the first time I've seen these errors and I have restarted this cluster many times with each taking 15-20 mins just like always. That's what prompted me to ask about this issue.

So this setting needs to be changed on the CM's server.conf, not the indexers themselves?

maraman_splunk · ‎02-03-2020

Hi, you are probably looking at the restart timeout setting on the CM (see link text and link text)

[clustering]
restart_timeout = time_in_sec 
# default is 60, probably a good idea to really increase here (to avoid the cluster to go in fix mode)  but still adapt it to the time it usually take for a idx to restart (use something like 3600 if you really want not to restart in that case but obviously if your idx crash in the middle of the restart, this will take more time to detect)

DEAD_BEEF · ‎02-05-2020

I will try adjusting this. Each idx takes on average 15-20 mins, my current timeout setting is 15mins, so maybe I just expand it to 30m to be safe?

maraman_splunk · ‎02-05-2020

Yes, it is a CM setting. 30 min(1800s) seem to be appropriate for your env.

DEAD_BEEF · ‎02-08-2020

Just finished a rolling restart and no errors anymore after increasing the timeout to 30mins. Thank you both for the assistance!

Multi-site Indexer rolling restart - indexer fails to restart/timeout

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)