DMC Alert - Search Peer Not Responding is great for getting notifications when a Splunk instance is having issues, but I find that it will fire off false positives throughout the day. My suspicion is that either the Distributed Management Console or its search peers are busy when a status check is initiated, and the check times out.
I'd be interested in increasing this timeout as a way of troubleshooting the issue, but am not quite sure which configuration setting controls it. The alert in question:
| rest splunk_server=local /services/search/distributed/peers/
| where status!="Up"
| fields peerName, status
| rename peerName as Instance, status as Status
And when I read the distsearch.conf spec:
statusTimeout = <int, in seconds>
* Set connection timeout when gathering a search peer's basic info (/services/server/info).
* Note: Read/write timeouts are automatically set to twice this value.
* Defaults to 10.
My expectation is that increasing the statusTimeout on the DMC will give the searchPeers more slack as the DMC tries to get each Peers info, which in turn will result in less peers showing up as "Down" in /services/search/distributed/peers/
Has anybody done anything along these lines? Is there anything I am missing or should look into more? Thanks for any advice!
Ultimately, this depends on the exact nature of the failure that leads the distributed search framework on the DMC to declare the status of some of your peers as "down" intermittently.
I think that statusTimeout is a good guess here if you need to pick one timeout setting to extend, but ultimately it would be better to review the details of the peer failure in splunkd.log to understand what timeout led to declaring the peer down (the search-head should know this) and what was going on on the peer itself.