Splunk Enterprise

Cluster Master - Indexer rolling restart. Some Peers (indexers) keep staying with Status Restarting

edoardo_vicendo
Contributor

Hello,

I am running Splunk Enterprise 9.0.2 on a Multi Site Indexer Cluster.

In the Cluster Master, under settings >> Indexer Clustering I have started a Searchable Indexer Rolling restart (no "Force" flag, no "Site Order" flag) and some of my Indexers were stuck with Status Restarting. Never happened before.

Here below some logs, as you can see the fist Indexer (Site 2 - IDX03) restarted automatically, then the second Indexer (Site 2 - IDX02) was stuck.

After some time I manually restarted it from CLI. The same happened to the third Indexer (Site 2 - IDX01), then for the remaining ones the issue didn't happened.

Site 2 - IDX03
04-03-2023 14:44:15.266 +0200 INFO  CMSlave [2862 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690525855.266419 searchable_flag=1

Site 2 - IDX02 - Stuck
04-03-2023 14:52:28.612 +0200 INFO  CMSlave [29466 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690526348.612353 searchable_flag=1

Site 2 - IDX01 - Stuck
04-03-2023 15:46:33.294 +0200 INFO  CMSlave [40062 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690529593.294619 searchable_flag=1

Site 1 - IDX01
04-03-2023 16:14:08.911 +0200 INFO  CMSlave [19756 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531248.911129 searchable_flag=1

Site 1 - IDX03
04-03-2023 16:17:37.570 +0200 INFO  CMSlave [4829 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531457.570617 searchable_flag=1

Site 1 - IDX02
04-03-2023 16:22:32.841 +0200 INFO  CMSlave [30733 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531752.841733 searchable_flag=1

 

What is strange to me is the value I see in the logs:

- restartTimeout=10000000

- threshold=1690525855

 

Checking here documentation:

https://docs.splunk.com/Documentation/Splunk/9.0.2/Indexer/Userollingrestart

 

I have default values in my Cluster Master:

server.conf

[clustering]
heartbeat_timeout = 60
restart_timeout = 60
decommission_search_jobs_wait_secs = 180

limits.conf

[search]
search_retry = 0

 

Do you know what have cause the Stuck and also why I see those high restartTimeout values in the logs that does not reflect what I have in my configurations?

 

Thanks a lot,

Edoardo

Labels (3)
Tags (2)
0 Karma
Get Updates on the Splunk Community!

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation

As of Splunk Cloud Platform 9.3.2408 and Splunk Enterprise 9.4, classic dashboard export features are now ...

Explore the Latest Educational Offerings from Splunk (November Releases)

At Splunk Education, we are committed to providing a robust learning experience for all users, regardless of ...

New This Month in Splunk Observability Cloud - Metrics Usage Analytics, Enhanced K8s ...

The latest enhancements across the Splunk Observability portfolio deliver greater flexibility, better data and ...