Search Heads Health Report Alerting on Decommissio...

crickel · ‎10-04-2023

We are in the process of a full hardware upgrade of all our indexers in our distributed environment. We have three standalone search heads connected to a cluster of many indexers. In the process, we are proceeding one at a time:

1. Loading up a new indexer
2. Integrating it into the cluster
3. Taking an old indexer offline, enforcing counts

When the decommissioning process finishes and the old indexers are gracefully shutdown, we have an alert that appears on our search heads in the Splunk Health Report: "The search head lost connection to the following peers: <decommissioned peer>. If there are unstable peers, confirm that the timeout (connectionTimeout and authTokenConnectionTimeout) settings in distsearch.conf are at appropriate values."

I cannot figure out why we are seeing this alert. My conclusion is that we must be missing a step somewhere.

To decommission a server, we do the following:
1. On the indexer: splunk offline enforce-counts
2. On the cluster master: splunk remove cluster-peers <GUID>
3. On the indexer: Completely uninstall Splunk.
3. On the cluster master: Rebalance indexes.

We have also tried reloading the health.conf configuration by running '|rest /services/configs/conf-health.conf/_reload' on the search heads, to no effect.

We cannot figure out where the health report is retaining this old data from, and the _internal logs clearly show that the moment of the GracefulShutdown transition on the Cluster Master is where the PeriodicHealthReporter component on the Search Heads begins to alert. The indexers in question are no longer listed as search peers on the search heads, and they're not listed as search peers on the cluster master either. The monitoring console looks fine. What could we be missing?

richgalloway · ‎10-04-2023

If you're using Indexer Discovery then nothing else should need to be done. Otherwise, go to each SH and remove the indexer from the Search Peers list (Settings->Distributed search) prior to shutting down the indexer.

---
If this reply helps you, Karma would be appreciated.

crickel · ‎10-04-2023

From my understanding, Indexer Discovery is used on Forwarders to send data to Splunk, not on Search Heads. We don't have it enabled there.

The indexers in question are not currently present in the Search Peers list on the Search Heads under Settings -> Distributed Search -> Search Peers - we were under the impression that the cluster manager manages that list and should take care of all of the items there when servers are decommissioned. We'll definitely try removing them from the list beforehand to see if that makes a difference.

isoutamo · ‎10-04-2023

Hi

if you have normal SHs without any additional components like MC then your steps should be enough. But if you have e.g. MC configured like distributed mode with those individual nodes (you shouldn’t) then you need to remove those from distributed search list. So check your distributed search list definition and update it if needed..

r. Ismo

crickel · ‎10-09-2023

So to clarify:

We have a distributed environment, with a cluster of indexers being managed by a Cluster Master. We have the Search Heads configured as standalone search heads. The Search Peers are not configured in distsearch.conf on the search heads - they just connect to the cluster master for the list of indexers.

We attempted to remove the peers from the list of Search Peers in Distributed Search in Settings, and got an error stating, " Cannot remove peer... This peer is a part of a cluster." As you would expect in a clustered environment.

We were able to delete the peers from the Cluster Master, but deleting the peers there is what causes the Search Heads to complain about losing connection to search peers, as it appears the Cluster Master doesn't inform the Search Heads about the change in the search peer list.

We were also able to find a window in which there were no scheduled searches running that we could restart the search heads. Restarting the search heads caused it to reload the list of search peers from the cluster master and it stopped giving the error.

Is there another way to force search heads to refresh this cached list of search peers from the Cluster Master without restarting them?

isoutamo · ‎10-13-2023

If you have defined all peers via adding cluster as a search target, then just on cm "splunk remove cluster-peers <GUID>" should be enough to remove that from CM's search peer list after you have remove that peer from cluster. If this didn't work then you should create a support case to splunk.

Of course if you have configured manually something extra to your heart report then you probably need to update it? See https://docs.splunk.com/Documentation/Splunk/9.1.1/DMC/Configurefeaturemonitoring#:~:text=Log%20in%2...

crickel · ‎10-13-2023

That was one of our steps in the decommissioning process we were using. Removing the host from the cluster peers didn't remove them from whatever list the Health Reporter component is using on the search heads. They were definitely removed - looking at Settings -> Distributed Search -> Search Peers clearly shows them not being present.

Yet the Health Reporter alerts still complains about a lack of connectivity to the decommissioned Search Peer. It appears the only solution to reload whatever list the Health Reporter has internally is to restart the Splunk service on the Search Head. Or to disable the Health Reporter component for Search Peer connectivity entirely - there's no half measures or custom lists in the health.conf file.

Search Heads Health Report Alerting on Decommissioned Indexer

distributed search

indexer

proactive Splunk component monitoring

search head

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio