Deployment Architecture

Sporadic "Timed out waiting for peer" messsages when querying search peers / indexer cluster

althomas
Communicator

Recently we've been noticing a lot of searches have been getting connection timeouts when trying to query our indexer cluster.

We keep getting the message:

2 errors occurred while the search was executing. Therefore, search results might be incomplete. Hide errors.
Error connecting: Connect Timeout Timeout error.
Timed out waiting for peer searchpeer01. Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased.


Delving into the search.log, we see that we are getting 502 Bad Gateway from the indexer cluster:

06-28-2021 12:45:14.663 ERROR SearchResultTransaction - Got status 502 from https://10.0.0.43:8089/services/streams/search?sh_sid=scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630
06-28-2021 12:45:14.663 ERROR SearchResultParser - HTTP error status message from https://10.0.0.43:8089/services/streams/search?sh_sid=scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630: Error connecting: Connect Timeout
06-28-2021 12:45:14.663 WARN  SearchResultCollator - Failure received on retry collector. _unresolvedRetries=1
06-28-2021 12:45:14.663 WARN  SearchResultParserExecutor - Error connecting: Connect Timeout Timeout error. for collector=searchpeer01
06-28-2021 12:45:14.663 ERROR DispatchThread - sid:scheduler__username_aW52X2NpdF9zbm93X3NlYXJjaA__RMD565f4e7f87d23277d_at_1624880700_38630 Timed out waiting for peer searchpeer01.  Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased.

Considering  the receiveTimeout is 600 seconds, I don't think that will change anything. I'm not sure where these 502 errors are coming from or what to do about them?

Does anyone have any insight into what may be happening? Running version 8.1.3 on the search head and 7.3.3 on the indexer cluster (though planning to upgrade to 8.1.4 as soon as we are able to).

 

Thanks!

0 Karma

agneticdk
Path Finder

Hi

I see the exact same problems on a 8.0.4 indexercluster and search head cluster. We have sporadic errors and timeouts. 

Servers a 80 cores dualsocket, 386 GB ram, all SSD, and fiber network. Ping around 1 ms between all servers. We also have no ingestion errors, or other network related errors, it is ONLY regarding searches.

Also I see many of these types of errors (though only logged as warning?)in the splunkd.log:

09-10-2021 12:39:03.296 +0200 WARN HttpListener - Socket error from "IPaddress":47270 while accessing /services/streams/search: Broken pipe

on all indexers. When we see many of these, we see several searches, that in search.log, logs the exact same errors as posted above. Ie searches failing to retrive correct result.

Have any of you had any luck in mitigating this ? Or should next step be a support case.

0 Karma

Terpz
Loves-to-Learn Lots

We're seeing the same issue on 8.2.1, also not seeing any hw/network issues also server is heavily spec'ed 

0 Karma

ktatrifork
Loves-to-Learn

We have seen this "broken pipe" error on our environments as well. Not to a great extend, but we still see it, and we have to rerun the affected searches. Not sure what the cause of this is.

0 Karma

althomas
Communicator

We had, for various reasons, different versions of enterprise servers due to a merging of sites and a stilted roll-forward schedule. Because of these issues, we pushed to move everything onto the same version and this resolved most of the issues.  

We still have other issues because we have multiple sites, some with lots of latency, but this isn't one of them.  

I would probably recommend a support case or an upgrade to the latest 8.1.X

FYI 8.0.X is EOL from next month.

0 Karma

agneticdk
Path Finder

OK, thank you.

Yes, an upgrade is definetly also in the works. Might do that before raising ticket.

 

André

DanielAmlung
Path Finder

Adding my 2 cents here - we have the exact same error messages. Also a multisite Cluster and a Search Head Cluster - all Hardware based.

Since we updated to 8.2.2 this issues startet to occur. We have timeouts on our Search Head Cluster Members

"Timed out waiting for peer [XXX] . Search results might be incomplete! If this occurs frequently, receiveTimeout in distsearch.conf might need to be increased."

And we also have the broken pipe events for our indexers. Splunk Support so far couldnt help. Their last resort was to look at the network and os level.

Before we updated we had no issues, now they started...

0 Karma

codebuilder
Influencer

Have you checked network latency between your SHC nodes and the indexers? A simple ping is a good place to start...

----
An upvote would be appreciated and Accept Solution if it helps!
0 Karma

althomas
Communicator

It's on the same network -- ping is 0-1 ms.

0 Karma

agneticdk
Path Finder

Just an update on my end on this. 

An upgrade fixed the problem. I think it was related to a setting around sslCompression internally in Splunk that looks to have been the issue.

The new version 8.2.2 has this setting set to false, it was true in the old version we ran (8.1.3).

 

In server.conf on both search heads (search head cluster) and indexeres (indexer cluster):

[sslConfig]

useClientSSLCompression = false

 

I saw that this fixed the same problems on another customer on 8.1.4 (I think).

 

useClientSSLCompression is default true in older versions, it is false on the new.

 

If you run older versions of splunk and search head cluster (I have not seen it on single search head and indexer cluster) - you could try the above to see if that works.

 

Regards

André

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...