Deployment Architecture

Search Heads are unable to distribute to Indexers

tlam_splunk
Splunk Employee
Splunk Employee

Find that it has the frequent error message that the search head cannot connect to the Indexer.

"Unable to distribute to peer named xx.xx.xx.xx:8089 at uri=xx.xx.xx.xx:8089 using the uri-scheme=https because peer has status=Down"

It happens from time to time. Any search head will have that error message. Also, SH will have the connection issue to any of the indexers in clusters (not restricted to particular indexer). During the worst case, the SH will report the error to all the indexers and cause some service outage.

But after sometimes without doing anything, the service will come back to normal.

Also the CPU and memory are normal even the error message is happening.

Labels (3)
0 Karma

tlam_splunk
Splunk Employee
Splunk Employee

1) Try to do the following:

Increase distsearch.conf timeouts on the SH as:

[distributedSearch]
statusTimeout = 120
connectionTimeout = 120
authTokenConnectionTimeout = 120
authTokenSendTimeout = 120
authTokenReceiveTimeout = 120

[replicationSettings]
connectionTimeout = 120
sendRcvTimeout = 120

On the indexers at distsearch.conf
[replicationSettings]
connectionTimeout = 120
sendRcvTimeout = 120 and

in server.conf

[httpServer]
busyKeepAliveIdleTimeout = 120

It seems it has a little bit improvement after the change. But the error message is still shown from time to time.

2) Checking the pstack output, find the I/O thread is busy with SSL handshakes. SSL operation slow down the process causing timeout.

3) SSL operation (especially compression) are CPU intensive routes. It needs to be invoked within the main IO thread (management port) that IO operation will be slowed down.

4) It's a known issue limitation of OpenSSL design - compression is done during the write operation blocking IO

5) Disable the SSL client compression in the search head

server.conf
[sslConfig]
useClientSSLCompression = false

6) The system is running back to normal after disabling the ssl client compression in the search head

0 Karma

PavelP
Motivator

Hello @tlam,

great analysis!

  • 5 - Does disabling the SSL compression increase the network transfer time? And demands increasing of timeouts even further?

  • 3 - Can you please check that used crypto is hardware accelerated?

    openssl engine -t -c

    cat /proc/crypto

Can you please post /proc/cpuinfo and /etc/*elease* ?

I cannot find a right reference right now, can you please check https://openwrt.org/docs/techref/hardware/cryptographic.hardware.accelerators

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...