Hi there.
This morning i did a SHC restart, and found something very strange from SHC Members,
WARN DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#1:8089 Authentication Failed
WARN DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#2:8089 Authentication Failed
WARN DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#3:8089 Authentication Failed
WARN DistributedPeer [1964778 DistributedPeerMonitorThread] - Peer:https://OLDIDX#4:8089 Authentication Failed
GetRemoteAuthToken [1964778 DistributedPeerMonitorThread] - Unable to get auth token from peer: https://OLDIDX#1:8089 due to: Connect Timeout; exceeded 5000 milliseconds
GetBundleListTransaction [1964778 DistributedPeerMonitorThread] - Unable to get bundle list from peer: https://OLDIDX#2:8089 due to: Connect Timeout; exceeded 60000 milliseconds
GetRemoteAuthToken [2212932 DistributedPeerMonitorThread] - Unable to get auth token from peer: https://OLDIDX#3:8089 due to: Connect Timeout; exceeded 5000 milliseconds
GetRemoteAuthToken [2212932 DistributedPeerMonitorThread] - Unable to get auth token from peer: https://OLDIDX#4:8089 due to: Connect Timeout; exceeded 5000 milliseconds
All OLDIDX are old servers, turned off and shut down!
None of SHC Members has OLDIDX#* in DistributedSeach conf 🙄
Recently i update a V7 to V8 Infrastructure.
I also searched all .conf for all ip of OLDIDX#*, none of them was found.
WHERE are those "artifact" stored?
Is there something in "raft" of new SHC? Need to remove alla SHC conf, and redo it from begin?
This messages in splunkd.log appears ONLY DURING the restart of SHC.
During the days, using the SHC, i never had, and still i don't have any type of similar message.
Thanks.
Found the problem, and fixed.
INFO KeyManagerSearchPeers [601811 TcpChannelThread] - Sending SHC_NODE_HOSTNAME public key to search peer: https://OLDIDX:8089
ERROR SHCMasterPeerHandler [601811 TcpChannelThread] - Could not send public key to peer=https://OLDIDX:8089 for server=SHC_NODE_HOSTNAME (reason='')
Inside the SHC nodes, there was a node to which, probably, time ago, i copied the "distsearch.conf" manually, without deleting all previous peers in UI or restarting with a clean empty "distsearch.conf". So previous peers remained "as artifacts" (inside a system kv table?), and splunkd read them as active also if not present nor visible in "distsearch.conf" or in UI DistSearch Panel.
Simple solution, from a SHC node UI,
After a clean restart, WARNINGS messages with old IDXS/PEERS went away.
So, it was a real artifact, i presume inside a system kv table, since on fs no .conf contains them !!! 🤷♂️
Found the problem, and fixed.
INFO KeyManagerSearchPeers [601811 TcpChannelThread] - Sending SHC_NODE_HOSTNAME public key to search peer: https://OLDIDX:8089
ERROR SHCMasterPeerHandler [601811 TcpChannelThread] - Could not send public key to peer=https://OLDIDX:8089 for server=SHC_NODE_HOSTNAME (reason='')
Inside the SHC nodes, there was a node to which, probably, time ago, i copied the "distsearch.conf" manually, without deleting all previous peers in UI or restarting with a clean empty "distsearch.conf". So previous peers remained "as artifacts" (inside a system kv table?), and splunkd read them as active also if not present nor visible in "distsearch.conf" or in UI DistSearch Panel.
Simple solution, from a SHC node UI,
After a clean restart, WARNINGS messages with old IDXS/PEERS went away.
So, it was a real artifact, i presume inside a system kv table, since on fs no .conf contains them !!! 🤷♂️
SHC is perfectly in sync!!!
I have no errors at all when all nodes are running, until i restart the SHC.
I will try a node at a time, and i'll monitor the logs.
It's only for curiosity, since SHC works perfectly 🤷♀️🤷♀️🤷♀️
IMO it's some kind of "artifact" remained from previously versions, over which i did updgrades (6 to 7 to 8 [where we changed the Indexers to new nodes/servers]).
Quite sure resetting the raft and rebuilding the SCH will delete the "issue".
Since you have a SHC try this search on each individual SH to see if there is a config mismatch (I'm thinking if you grew from single to cluster maybe).
| rest splunk_server=local /services/search/distributed/peers
The output should help you determine if one of the SH is out of sync.
Other than that is your SHC set to indexer discovery via the CM which may still have those entries?