Solved: OLD Indexers still give errors on SH-CLUSTER resta...

verbal_666 · ‎10-11-2024

Hi there.
This morning i did a SHC restart, and found something very strange from SHC Members,

WARN  DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#1:8089 Authentication Failed
WARN  DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#2:8089 Authentication Failed
WARN  DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#3:8089 Authentication Failed
WARN  DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#4:8089 Authentication Failed

GetRemoteAuthToken [1964778 DistributedPeerMonitorThread] - Unable to get auth token from peer: https://OLDIDX#1:8089 due to: Connect Timeout; exceeded 5000 milliseconds
GetBundleListTransaction [1964778 DistributedPeerMonitorThread] - Unable to get bundle list from peer: https://OLDIDX#2:8089 due to: Connect Timeout; exceeded 60000 milliseconds
GetRemoteAuthToken [2212932 DistributedPeerMonitorThread] - Unable to get auth token from peer: https://OLDIDX#3:8089 due to: Connect Timeout; exceeded 5000 milliseconds
GetRemoteAuthToken [2212932 DistributedPeerMonitorThread] - Unable to get auth token from peer: https://OLDIDX#4:8089 due to: Connect Timeout; exceeded 5000 milliseconds

All OLDIDX are old servers, turned off and shut down!

None of SHC Members has OLDIDX#* in DistributedSeach conf 🙄

Recently i update a V7 to V8 Infrastructure.

I also searched all .conf for all ip of OLDIDX#*, none of them was found.

WHERE are those "artifact" stored?

Is there something in "raft" of new SHC? Need to remove alla SHC conf, and redo it from begin?

This messages in splunkd.log appears ONLY DURING the restart of SHC.
During the days, using the SHC, i never had, and still i don't have any type of similar message.

Thanks.

verbal_666 · ‎10-24-2024

Found the problem, and fixed.

INFO  KeyManagerSearchPeers [601811 TcpChannelThread] - Sending SHC_NODE_HOSTNAME public key to search peer: https://OLDIDX:8089
ERROR SHCMasterPeerHandler [601811 TcpChannelThread] - Could not send public key to peer=https://OLDIDX:8089 for server=SHC_NODE_HOSTNAME (reason='')

Inside the SHC nodes, there was a node to which, probably, time ago, i copied the "distsearch.conf" manually, without deleting all previous peers in UI or restarting with a clean empty "distsearch.conf". So previous peers remained "as artifacts" (inside a system kv table?), and splunkd read them as active also if not present nor visible in "distsearch.conf" or in UI DistSearch Panel.

Simple solution, from a SHC node UI,

Delete all peers, one by one (the delete sync with other nodes)
Insert again all peers, one by one (the insert sync with other nodes)

After a clean restart, WARNINGS messages with old IDXS/PEERS went away.

So, it was a real artifact, i presume inside a system kv table, since on fs no .conf contains them !!! 🤷‍♂️

View solution in original post

verbal_666 · ‎10-24-2024

Found the problem, and fixed.

INFO  KeyManagerSearchPeers [601811 TcpChannelThread] - Sending SHC_NODE_HOSTNAME public key to search peer: https://OLDIDX:8089
ERROR SHCMasterPeerHandler [601811 TcpChannelThread] - Could not send public key to peer=https://OLDIDX:8089 for server=SHC_NODE_HOSTNAME (reason='')

Inside the SHC nodes, there was a node to which, probably, time ago, i copied the "distsearch.conf" manually, without deleting all previous peers in UI or restarting with a clean empty "distsearch.conf". So previous peers remained "as artifacts" (inside a system kv table?), and splunkd read them as active also if not present nor visible in "distsearch.conf" or in UI DistSearch Panel.

Simple solution, from a SHC node UI,

Delete all peers, one by one (the delete sync with other nodes)
Insert again all peers, one by one (the insert sync with other nodes)

After a clean restart, WARNINGS messages with old IDXS/PEERS went away.

So, it was a real artifact, i presume inside a system kv table, since on fs no .conf contains them !!! 🤷‍♂️

verbal_666 · ‎10-17-2024

SHC is perfectly in sync!!!
I have no errors at all when all nodes are running, until i restart the SHC.
I will try a node at a time, and i'll monitor the logs.

It's only for curiosity, since SHC works perfectly 🤷‍♀️🤷‍♀️🤷‍♀️
IMO it's some kind of "artifact" remained from previously versions, over which i did updgrades (6 to 7 to 8 [where we changed the Indexers to new nodes/servers]).
Quite sure resetting the raft and rebuilding the SCH will delete the "issue".

dural_yyz · ‎10-17-2024

Since you have a SHC try this search on each individual SH to see if there is a config mismatch (I'm thinking if you grew from single to cluster maybe).

|  rest splunk_server=local /services/search/distributed/peers

The output should help you determine if one of the SH is out of sync.

Other than that is your SHC set to indexer discovery via the CM which may still have those entries?

OLD Indexers still give errors on SH-CLUSTER restart

distributed search

Linux

search head

search head clustering

search peer

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?