I'm getting quite a few "Unable to distribute to peer..." messages when searching in splunk.
The reasons given tend to be '...because peer has status = "Down".' or Authentication Failed.
Sometimes just reloading the page will get through to the search peers. Sometimes it gives me that error a number of times in a row. But I've verified that the peer is not down, and I can connect to it from the search head with no problems.
The splunk servers are in different datacenters, and all I can think of is that there's a little bit of network lag and the connections aren't being made quickly enough?
Is there a config option to alter whatever timeout there is for this? Am I on the right track, or can someone suggest what else to look at?
There is indeed. Have a look at distsearch.conf (http://www.splunk.com/base/Documentation/latest/Admin/Distsearchconf ), particularly the following parameters:
connectionTimeout = <int, in seconds> * Amount of time in seconds to use as a timeout during search peer connection establishment. sendTimeout = <int, in seconds> * Amount of time in seconds to use as a timeout while trying to write/send data to a search peer. receiveTimeout = <int, in seconds> * Amount of time in seconds to use as a timeout while trying to read/receive data from a search peer.
The defaults for these (and other) settings are set in
After some further inquiries with our Dev team, I've learned that the timeout settings in distsearch.conf will not actually have any effect on the problem.
It seems that what is happening is that we are timing out at time, while trying to read the auth token from the peer (Unable to connect to peer uri...) . The httpclient timeouts that affect this behavior are actually hardcoded and NOT configurable.
connectionTimeout = 5;
sendTimeout = 10;
rcvTimeout = 10;
There isn't one setting exposed which you could use to control such timeouts.
The timeout settings for the authentication token exchange between search-head and peers are exposed now as configurable values in distsearch.conf (since v4.3.6):
* Maximum number of seconds to connect to a remote search peer, when getting its auth token
* Default is 5
* Maximum number of seconds to send a request to the remote peer, when getting its auth token
* Default is 10
* Maximum number of seconds to receive a response from a remote peer, when getting its auth token
* Default is 10
If you don't see any offending requests on the peer and the auth status is still failed then the request is not able to make to the peer at all. Here you may want to investigate general connectivity to the peer and adjust authTokenConnectionTimeout and authTokenSendTimeout.
For failed connections check the splunkd.log on the search-head for Warn messages from UserManagerPro component:
WARN UserManagerPro - Unable to connect to peeruri=
Additionally splunkd_access.log on the indexing peer will show the POST requests to this endpoint: /services/admin/auth-tokens
If these requests are taking longer than 10000ms then you are hitting the default timeout (authTokenReceiveTimeout).
Where else would we see these authToken related messages in the log? The indexers are still intermittently down and I cannot figure out why.
authTokenConnectionTimeout = 20
authTokenReceiveTimeout = 30
authTokenSendTimeout = 30
I still see this error after a monute or so:
Unable to distribute to peer named BLAH at uri https://BLAH:8089 because replication was unsuccessful. replicationStatus Failed