Splunk Enterprise

Cluster Master - Indexer rolling restart. Some Peers (indexers) keep staying with Status Restarting

edoardo_vicendo
Contributor

Hello,

I am running Splunk Enterprise 9.0.2 on a Multi Site Indexer Cluster.

In the Cluster Master, under settings >> Indexer Clustering I have started a Searchable Indexer Rolling restart (no "Force" flag, no "Site Order" flag) and some of my Indexers were stuck with Status Restarting. Never happened before.

Here below some logs, as you can see the fist Indexer (Site 2 - IDX03) restarted automatically, then the second Indexer (Site 2 - IDX02) was stuck.

After some time I manually restarted it from CLI. The same happened to the third Indexer (Site 2 - IDX01), then for the remaining ones the issue didn't happened.

Site 2 - IDX03
04-03-2023 14:44:15.266 +0200 INFO  CMSlave [2862 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690525855.266419 searchable_flag=1

Site 2 - IDX02 - Stuck
04-03-2023 14:52:28.612 +0200 INFO  CMSlave [29466 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690526348.612353 searchable_flag=1

Site 2 - IDX01 - Stuck
04-03-2023 15:46:33.294 +0200 INFO  CMSlave [40062 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690529593.294619 searchable_flag=1

Site 1 - IDX01
04-03-2023 16:14:08.911 +0200 INFO  CMSlave [19756 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531248.911129 searchable_flag=1

Site 1 - IDX03
04-03-2023 16:17:37.570 +0200 INFO  CMSlave [4829 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531457.570617 searchable_flag=1

Site 1 - IDX02
04-03-2023 16:22:32.841 +0200 INFO  CMSlave [30733 CMHeartbeatThread] - Cluster manager has instructed peer to restart, restartTimeout=10000000 threshold=1690531752.841733 searchable_flag=1

 

What is strange to me is the value I see in the logs:

- restartTimeout=10000000

- threshold=1690525855

 

Checking here documentation:

https://docs.splunk.com/Documentation/Splunk/9.0.2/Indexer/Userollingrestart

 

I have default values in my Cluster Master:

server.conf

[clustering]
heartbeat_timeout = 60
restart_timeout = 60
decommission_search_jobs_wait_secs = 180

limits.conf

[search]
search_retry = 0

 

Do you know what have cause the Stuck and also why I see those high restartTimeout values in the logs that does not reflect what I have in my configurations?

 

Thanks a lot,

Edoardo

Labels (3)
Tags (2)
0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...