Deployment Architecture

Sporadic "Timed out waiting for peer" messsages when querying search peers / indexer cluster

althomas
Communicator

Recently we've been noticing a lot of searches have been getting connection timeouts when trying to query our indexer cluster.

We keep getting the message:

2 errors occurred while the search was executing. Therefore, search results might be incomplete. Hide errors.
Error connecting: Connect Timeout Timeout error.
Timed out waiting for peer searchpeer01. Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased.


Delving into the search.log, we see that we are getting 502 Bad Gateway from the indexer cluster:

06-28-2021 12:45:14.663 ERROR SearchResultTransaction - Got status 502 from https://10.0.0.43:8089/services/streams/search?sh_sid=scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630
06-28-2021 12:45:14.663 ERROR SearchResultParser - HTTP error status message from https://10.0.0.43:8089/services/streams/search?sh_sid=scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630: Error connecting: Connect Timeout
06-28-2021 12:45:14.663 WARN  SearchResultCollator - Failure received on retry collector. _unresolvedRetries=1
06-28-2021 12:45:14.663 WARN  SearchResultParserExecutor - Error connecting: Connect Timeout Timeout error. for collector=searchpeer01
06-28-2021 12:45:14.663 ERROR DispatchThread - sid:scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630 Timed out waiting for peer searchpeer01.  Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased.

Considering  the receiveTimeout is 600 seconds, I don't think that will change anything. I'm not sure where these 502 errors are coming from or what to do about them?

Does anyone have any insight into what may be happening? Running version 8.1.3 on the search head and 7.3.3 on the indexer cluster (though planning to upgrade to 8.1.4 as soon as we are able to).

 

Thanks!

0 Karma

agneticdk
Path Finder

Hi

I see the exact same problems on a 8.0.4 indexercluster and search head cluster. We have sporadic errors and timeouts. 

Servers a 80 cores dualsocket, 386 GB ram, all SSD, and fiber network. Ping around 1 ms between all servers. We also have no ingestion errors, or other network related errors, it is ONLY regarding searches.

Also I see many of these types of errors (though only logged as warning?)in the splunkd.log:

09-10-2021 12:39:03.296 +0200 WARN HttpListener - Socket error from "IPaddress":47270 while accessing /services/streams/search: Broken pipe

on all indexers. When we see many of these, we see several searches, that in search.log, logs the exact same errors as posted above. Ie searches failing to retrive correct result.

Have any of you had any luck in mitigating this ? Or should next step be a support case.

0 Karma

Terpz
New Member

We're seeing the same issue on 8.2.1, also not seeing any hw/network issues also server is heavily spec'ed 

0 Karma

ktatrifork
Loves-to-Learn

We have seen this "broken pipe" error on our environments as well. Not to a great extend, but we still see it, and we have to rerun the affected searches. Not sure what the cause of this is.

0 Karma

althomas
Communicator

We had, for various reasons, different versions of enterprise servers due to a merging of sites and a stilted roll-forward schedule. Because of these issues, we pushed to move everything onto the same version and this resolved most of the issues.  

We still have other issues because we have multiple sites, some with lots of latency, but this isn't one of them.  

I would probably recommend a support case or an upgrade to the latest 8.1.X

FYI 8.0.X is EOL from next month.

0 Karma

agneticdk
Path Finder

OK, thank you.

Yes, an upgrade is definetly also in the works. Might do that before raising ticket.

 

André

codebuilder
Influencer

Have you checked network latency between your SHC nodes and the indexers? A simple ping is a good place to start...

----
An upvote would be appreciated and Accept Solution if it helps!
0 Karma

althomas
Communicator

It's on the same network -- ping is 0-1 ms.

0 Karma
Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.