Splunk Enterprise

Cluster Master - Indexer rolling restart. Some Peers (indexers) keep staying with Status Restarting

edoardo_vicendo
Contributor

Hello,

I am running Splunk Enterprise 9.0.2 on a Multi Site Indexer Cluster.

In the Cluster Master, under settings >> Indexer Clustering I have started a Searchable Indexer Rolling restart (no "Force" flag, no "Site Order" flag) and some of my Indexers were stuck with Status Restarting. Never happened before.

Here below some logs, as you can see the fist Indexer (Site 2 - IDX03) restarted automatically, then the second Indexer (Site 2 - IDX02) was stuck.

After some time I manually restarted it from CLI. The same happened to the third Indexer (Site 2 - IDX01), then for the remaining ones the issue didn't happened.

Site 2 - IDX03
04-03-2023 14:44:15.266 +0200 INFO  CMSlave [2862 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690525855.266419 searchable_flag=1

Site 2 - IDX02 - Stuck
04-03-2023 14:52:28.612 +0200 INFO  CMSlave [29466 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690526348.612353 searchable_flag=1

Site 2 - IDX01 - Stuck
04-03-2023 15:46:33.294 +0200 INFO  CMSlave [40062 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690529593.294619 searchable_flag=1

Site 1 - IDX01
04-03-2023 16:14:08.911 +0200 INFO  CMSlave [19756 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531248.911129 searchable_flag=1

Site 1 - IDX03
04-03-2023 16:17:37.570 +0200 INFO  CMSlave [4829 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531457.570617 searchable_flag=1

Site 1 - IDX02
04-03-2023 16:22:32.841 +0200 INFO  CMSlave [30733 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531752.841733 searchable_flag=1

 

What is strange to me is the value I see in the logs:

- restartTimeout=10000000

- threshold=1690525855

 

Checking here documentation:

https://docs.splunk.com/Documentation/Splunk/9.0.2/Indexer/Userollingrestart

 

I have default values in my Cluster Master:

server.conf

[clustering]
heartbeat_timeout = 60
restart_timeout = 60
decommission_search_jobs_wait_secs = 180

limits.conf

[search]
search_retry = 0

 

Do you know what have cause the Stuck and also why I see those high restartTimeout values in the logs that does not reflect what I have in my configurations?

 

Thanks a lot,

Edoardo

Labels (3)
Tags (2)
0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...