Hello,
I am running Splunk Enterprise 9.0.2 on a Multi Site Indexer Cluster.
In the Cluster Master, under settings >> Indexer Clustering I have started a Searchable Indexer Rolling restart (no "Force" flag, no "Site Order" flag) and some of my Indexers were stuck with Status Restarting. Never happened before.
Here below some logs, as you can see the fist Indexer (Site 2 - IDX03) restarted automatically, then the second Indexer (Site 2 - IDX02) was stuck.
After some time I manually restarted it from CLI. The same happened to the third Indexer (Site 2 - IDX01), then for the remaining ones the issue didn't happened.
Site 2 - IDX03
04-03-2023 14:44:15.266 +0200 INFO CMSlave [2862 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690525855.266419 searchable_flag=1
Site 2 - IDX02 - Stuck
04-03-2023 14:52:28.612 +0200 INFO CMSlave [29466 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690526348.612353 searchable_flag=1
Site 2 - IDX01 - Stuck
04-03-2023 15:46:33.294 +0200 INFO CMSlave [40062 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690529593.294619 searchable_flag=1
Site 1 - IDX01
04-03-2023 16:14:08.911 +0200 INFO CMSlave [19756 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531248.911129 searchable_flag=1
Site 1 - IDX03
04-03-2023 16:17:37.570 +0200 INFO CMSlave [4829 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531457.570617 searchable_flag=1
Site 1 - IDX02
04-03-2023 16:22:32.841 +0200 INFO CMSlave [30733 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531752.841733 searchable_flag=1
What is strange to me is the value I see in the logs:
- restartTimeout=10000000
- threshold=1690525855
Checking here documentation:
https://docs.splunk.com/Documentation/Splunk/9.0.2/Indexer/Userollingrestart
I have default values in my Cluster Master:
server.conf
[clustering]
heartbeat_timeout = 60
restart_timeout = 60
decommission_search_jobs_wait_secs = 180
limits.conf
[search]
search_retry = 0
Do you know what have cause the Stuck and also why I see those high restartTimeout values in the logs that does not reflect what I have in my configurations?
Thanks a lot,
Edoardo