The DMC Alert - search peer not responding has false positives. Anyone addressed this issue with a better modified search.
We have that false positives lately too and we found out with helkp of the following search that our peers ran into authTokenConnectionTimeout which defaults to 5 seconds
index=_internal (GetRemoteAuthToken OR DistributedPeer OR DistributedPeerManager) source!="/opt/splunk/var/log/splunk/remote_searches.log"
| rex field=_raw "Peer:(?<peer>\S+)"
| rex field=_raw "peer: (?<peer>\S+)"
| rex field=_raw "uri=(?<peer>\S+)"
| eval peer = replace(peer, "https://", "")
| rex field=_raw "\d+-\d+-\d+\s+\d+:\d+:\d+.\d+\s+\S+\s+(?<loglevel>\S+)\s+(?<process>\S+)"
| rex field=_raw "\] - (?<logMsg>.+)"
| reverse
| eval time=strftime(_time, "%d.%m.%Y %H:%M:%S.%Q")
| bin span=1d _time
| stats list(*) as * by peer _time
| table peer time loglevel process logMsg
Have you made this change and what would you suggest to set the statusTimeout in seconds. Are there any negative effects due to increasing the statusTimeout.
Can you try increasing the statusTimeout in distsearch.conf on the DMC will give the searchPeers more slack as the DMC tries to get each Peers info, which in turn will result in less peers showing up as "Down" in /services/search/distributed/peers/.
statusTimeout = <int, in seconds>
* Set connection timeout when gathering a search peer's basic info (/services/server/info).
* Note: Read/write timeouts are automatically set to twice this value.
* Defaults to 10.
You can do this from Setting >>Distributed search >>Distributed search>>Timeout settings and changing the Status timeout (in seconds) from default value 10 to something larger considering your environment.