Deployment Architecture

OLD Indexers still give errors on SH-CLUSTER restart

verbal_666
Builder

Hi there.
This morning i did a SHC restart, and found something very strange from SHC Members,

WARN  DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#1:8089 Authentication Failed
WARN  DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#2:8089 Authentication Failed
WARN  DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#3:8089 Authentication Failed
WARN  DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#4:8089 Authentication Failed
GetRemoteAuthToken [1964778 DistributedPeerMonitorThread] - Unable to get auth token from peer: https://OLDIDX#1:8089 due to: Connect Timeout; exceeded 5000 milliseconds
GetBundleListTransaction [1964778 DistributedPeerMonitorThread] - Unable to get bundle list from peer: https://OLDIDX#2:8089 due to: Connect Timeout; exceeded 60000 milliseconds
GetRemoteAuthToken [2212932 DistributedPeerMonitorThread] - Unable to get auth token from peer: https://OLDIDX#3:8089 due to: Connect Timeout; exceeded 5000 milliseconds
GetRemoteAuthToken [2212932 DistributedPeerMonitorThread] - Unable to get auth token from peer: https://OLDIDX#4:8089 due to: Connect Timeout; exceeded 5000 milliseconds


All OLDIDX are old servers, turned off and shut down!

None of SHC Members has OLDIDX#* in DistributedSeach conf 🙄

Recently i update a V7 to V8 Infrastructure.

I also searched all .conf for all ip of OLDIDX#*, none of them was found.

WHERE are those "artifact" stored?

Is there something in "raft" of new SHC? Need to remove alla SHC conf, and redo it from begin?

 

This messages in splunkd.log appears ONLY DURING the restart of SHC.
During the days, using the SHC, i never had, and still i don't have any type of similar message.


Thanks.

0 Karma
1 Solution

verbal_666
Builder

Found the problem, and fixed.

INFO  KeyManagerSearchPeers [601811 TcpChannelThread] - Sending SHC_NODE_HOSTNAME public key to search peer: https://OLDIDX:8089
ERROR SHCMasterPeerHandler [601811 TcpChannelThread] - Could not send public key to peer=https://OLDIDX:8089 for server=SHC_NODE_HOSTNAME (reason='')


Inside the SHC nodes, there was a node to which, probably, time ago, i copied the "distsearch.conf" manually, without deleting all previous peers in UI or restarting  with a clean empty "distsearch.conf". So previous peers remained "as artifacts" (inside a system kv table?), and splunkd read them as active also if not present nor visible in "distsearch.conf" or in UI DistSearch Panel.

Simple solution, from a SHC node UI,

  • Delete all peers, one by one (the delete sync with other nodes)
  • Insert again all peers, one by one (the insert sync with other nodes)

 

After a clean restart, WARNINGS messages with old IDXS/PEERS went away.

So, it was a real artifact, i presume inside a system kv table, since on fs no .conf contains them !!! 🤷‍♂️

View solution in original post

0 Karma

verbal_666
Builder

Found the problem, and fixed.

INFO  KeyManagerSearchPeers [601811 TcpChannelThread] - Sending SHC_NODE_HOSTNAME public key to search peer: https://OLDIDX:8089
ERROR SHCMasterPeerHandler [601811 TcpChannelThread] - Could not send public key to peer=https://OLDIDX:8089 for server=SHC_NODE_HOSTNAME (reason='')


Inside the SHC nodes, there was a node to which, probably, time ago, i copied the "distsearch.conf" manually, without deleting all previous peers in UI or restarting  with a clean empty "distsearch.conf". So previous peers remained "as artifacts" (inside a system kv table?), and splunkd read them as active also if not present nor visible in "distsearch.conf" or in UI DistSearch Panel.

Simple solution, from a SHC node UI,

  • Delete all peers, one by one (the delete sync with other nodes)
  • Insert again all peers, one by one (the insert sync with other nodes)

 

After a clean restart, WARNINGS messages with old IDXS/PEERS went away.

So, it was a real artifact, i presume inside a system kv table, since on fs no .conf contains them !!! 🤷‍♂️

0 Karma

verbal_666
Builder

SHC is perfectly in sync!!!
I have no errors at all when all nodes are running, until i restart the SHC.
I will try a node at a time, and i'll monitor the logs.

It's only for curiosity, since SHC works perfectly 🤷‍♀️🤷‍♀️🤷‍♀️
IMO it's some kind of "artifact" remained from previously versions, over which i did updgrades (6 to 7 to 8 [where we changed the Indexers to new nodes/servers]).
Quite sure resetting the raft and rebuilding the SCH will delete the "issue".

0 Karma

dural_yyz
Builder

Since you have a SHC try this search on each individual SH to see if there is a config mismatch (I'm thinking if you grew from single to cluster maybe).

|  rest splunk_server=local /services/search/distributed/peers

The output should help you determine if one of the SH is out of sync.

Other than that is your SHC set to indexer discovery via the CM which may still have those entries?

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...