Recently we've been noticing a lot of searches have been getting connection timeouts when trying to query our indexer cluster.
We keep getting the message:
2 errors occurred while the search was executing. Therefore, search results might be incomplete. Hide errors. Error connecting: Connect Timeout Timeout error. Timed out waiting for peer searchpeer01. Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased.
Delving into the search.log, we see that we are getting 502 Bad Gateway from the indexer cluster:
06-28-2021 12:45:14.663 ERROR SearchResultTransaction - Got status 502 from https://10.0.0.43:8089/services/streams/search?sh_sid=scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630 06-28-2021 12:45:14.663 ERROR SearchResultParser - HTTP error status message from https://10.0.0.43:8089/services/streams/search?sh_sid=scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630: Error connecting: Connect Timeout 06-28-2021 12:45:14.663 WARN SearchResultCollator - Failure received on retry collector. _unresolvedRetries=1 06-28-2021 12:45:14.663 WARN SearchResultParserExecutor - Error connecting: Connect Timeout Timeout error. for collector=searchpeer01 06-28-2021 12:45:14.663 ERROR DispatchThread - sid:scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630 Timed out waiting for peer searchpeer01. Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased.
Considering the receiveTimeout is 600 seconds, I don't think that will change anything. I'm not sure where these 502 errors are coming from or what to do about them?
Does anyone have any insight into what may be happening? Running version 8.1.3 on the search head and 7.3.3 on the indexer cluster (though planning to upgrade to 8.1.4 as soon as we are able to).
I see the exact same problems on a 8.0.4 indexercluster and search head cluster. We have sporadic errors and timeouts.
Servers a 80 cores dualsocket, 386 GB ram, all SSD, and fiber network. Ping around 1 ms between all servers. We also have no ingestion errors, or other network related errors, it is ONLY regarding searches.
Also I see many of these types of errors (though only logged as warning?)in the splunkd.log:
09-10-2021 12:39:03.296 +0200 WARN HttpListener - Socket error from "IPaddress":47270 while accessing /services/streams/search: Broken pipe
on all indexers. When we see many of these, we see several searches, that in search.log, logs the exact same errors as posted above. Ie searches failing to retrive correct result.
Have any of you had any luck in mitigating this ? Or should next step be a support case.
We have seen this "broken pipe" error on our environments as well. Not to a great extend, but we still see it, and we have to rerun the affected searches. Not sure what the cause of this is.
We had, for various reasons, different versions of enterprise servers due to a merging of sites and a stilted roll-forward schedule. Because of these issues, we pushed to move everything onto the same version and this resolved most of the issues.
We still have other issues because we have multiple sites, some with lots of latency, but this isn't one of them.
I would probably recommend a support case or an upgrade to the latest 8.1.X
FYI 8.0.X is EOL from next month.
Adding my 2 cents here - we have the exact same error messages. Also a multisite Cluster and a Search Head Cluster - all Hardware based.
Since we updated to 8.2.2 this issues startet to occur. We have timeouts on our Search Head Cluster Members
"Timed out waiting for peer [XXX] . Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased."
And we also have the broken pipe events for our indexers. Splunk Support so far couldnt help. Their last resort was to look at the network and os level.
Before we updated we had no issues, now they started...
Have you checked network latency between your SHC nodes and the indexers? A simple ping is a good place to start...
Just an update on my end on this.
An upgrade fixed the problem. I think it was related to a setting around sslCompression internally in Splunk that looks to have been the issue.
The new version 8.2.2 has this setting set to false, it was true in the old version we ran (8.1.3).
In server.conf on both search heads (search head cluster) and indexeres (indexer cluster):
useClientSSLCompression = false
I saw that this fixed the same problems on another customer on 8.1.4 (I think).
useClientSSLCompression is default true in older versions, it is false on the new.
If you run older versions of splunk and search head cluster (I have not seen it on single search head and indexer cluster) - you could try the above to see if that works.