Recently we've been noticing a lot of searches have been getting connection timeouts when trying to query our indexer cluster.
We keep getting the message:
2 errors occurred while the search was executing. Therefore, search results might be incomplete. Hide errors. Error connecting: Connect Timeout Timeout error. Timed out waiting for peer searchpeer01. Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased.
Delving into the search.log, we see that we are getting 502 Bad Gateway from the indexer cluster:
06-28-2021 12:45:14.663 ERROR SearchResultTransaction - Got status 502 from https://10.0.0.43:8089/services/streams/search?sh_sid=scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630 06-28-2021 12:45:14.663 ERROR SearchResultParser - HTTP error status message from https://10.0.0.43:8089/services/streams/search?sh_sid=scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630: Error connecting: Connect Timeout 06-28-2021 12:45:14.663 WARN SearchResultCollator - Failure received on retry collector. _unresolvedRetries=1 06-28-2021 12:45:14.663 WARN SearchResultParserExecutor - Error connecting: Connect Timeout Timeout error. for collector=searchpeer01 06-28-2021 12:45:14.663 ERROR DispatchThread - sid:scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630 Timed out waiting for peer searchpeer01. Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased.
Considering the receiveTimeout is 600 seconds, I don't think that will change anything. I'm not sure where these 502 errors are coming from or what to do about them?
Does anyone have any insight into what may be happening? Running version 8.1.3 on the search head and 7.3.3 on the indexer cluster (though planning to upgrade to 8.1.4 as soon as we are able to).
I see the exact same problems on a 8.0.4 indexercluster and search head cluster. We have sporadic errors and timeouts.
Servers a 80 cores dualsocket, 386 GB ram, all SSD, and fiber network. Ping around 1 ms between all servers. We also have no ingestion errors, or other network related errors, it is ONLY regarding searches.
Also I see many of these types of errors (though only logged as warning?)in the splunkd.log:
09-10-2021 12:39:03.296 +0200 WARN HttpListener - Socket error from "IPaddress":47270 while accessing /services/streams/search: Broken pipe
on all indexers. When we see many of these, we see several searches, that in search.log, logs the exact same errors as posted above. Ie searches failing to retrive correct result.
Have any of you had any luck in mitigating this ? Or should next step be a support case.
We have seen this "broken pipe" error on our environments as well. Not to a great extend, but we still see it, and we have to rerun the affected searches. Not sure what the cause of this is.
We had, for various reasons, different versions of enterprise servers due to a merging of sites and a stilted roll-forward schedule. Because of these issues, we pushed to move everything onto the same version and this resolved most of the issues.
We still have other issues because we have multiple sites, some with lots of latency, but this isn't one of them.
I would probably recommend a support case or an upgrade to the latest 8.1.X
FYI 8.0.X is EOL from next month.
Have you checked network latency between your SHC nodes and the indexers? A simple ping is a good place to start...