Splunk Enterprise

Cluster Master - Indexer rolling restart. Some Peers (indexers) keep staying with Status Restarting

edoardo_vicendo
Contributor

Hello,

I am running Splunk Enterprise 9.0.2 on a Multi Site Indexer Cluster.

In the Cluster Master, under settings >> Indexer Clustering I have started a Searchable Indexer Rolling restart (no "Force" flag, no "Site Order" flag) and some of my Indexers were stuck with Status Restarting. Never happened before.

Here below some logs, as you can see the fist Indexer (Site 2 - IDX03) restarted automatically, then the second Indexer (Site 2 - IDX02) was stuck.

After some time I manually restarted it from CLI. The same happened to the third Indexer (Site 2 - IDX01), then for the remaining ones the issue didn't happened.

Site 2 - IDX03
04-03-2023 14:44:15.266 +0200 INFO  CMSlave [2862 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690525855.266419 searchable_flag=1

Site 2 - IDX02 - Stuck
04-03-2023 14:52:28.612 +0200 INFO  CMSlave [29466 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690526348.612353 searchable_flag=1

Site 2 - IDX01 - Stuck
04-03-2023 15:46:33.294 +0200 INFO  CMSlave [40062 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690529593.294619 searchable_flag=1

Site 1 - IDX01
04-03-2023 16:14:08.911 +0200 INFO  CMSlave [19756 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531248.911129 searchable_flag=1

Site 1 - IDX03
04-03-2023 16:17:37.570 +0200 INFO  CMSlave [4829 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531457.570617 searchable_flag=1

Site 1 - IDX02
04-03-2023 16:22:32.841 +0200 INFO  CMSlave [30733 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531752.841733 searchable_flag=1

 

What is strange to me is the value I see in the logs:

- restartTimeout=10000000

- threshold=1690525855

 

Checking here documentation:

https://docs.splunk.com/Documentation/Splunk/9.0.2/Indexer/Userollingrestart

 

I have default values in my Cluster Master:

server.conf

[clustering]
heartbeat_timeout = 60
restart_timeout = 60
decommission_search_jobs_wait_secs = 180

limits.conf

[search]
search_retry = 0

 

Do you know what have cause the Stuck and also why I see those high restartTimeout values in the logs that does not reflect what I have in my configurations?

 

Thanks a lot,

Edoardo

Labels (3)
Tags (2)
0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...