Knowledge Management

Slow indexer/receiver detection capability

hrawat_splunk
Splunk Employee
Splunk Employee

9.1.3/9.2.1 onwards slow indexer/receiver detection capability is fully functional now (SPL-248188, SPL-248140).
 
https://docs.splunk.com/Documentation/Splunk/9.2.1/ReleaseNotes/Fixedissues
You can enable it on forwarding side in outputs.conf

maxSendQSize = <integer>
* The size of the tcpout client send buffer, in bytes.
  If tcpout client(indexer/receiver connection) send buffer is full,
  a new indexer is randomly selected from the list of indexers provided
  in the server setting of the target group stanza.
* This setting allows forwarder to switch to new indexer/receiver if current
  indexer/receiver is slow.
* A non-zero value means that max send buffer size is set.
* 0 means no limit on max send buffer size.
* Default: 0

Additionally 9.1.3/9.2.1 and above will correctly log target ipaddress causing tcpout blocking.

 

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

 


Note: This config works correctly starting 9.1.3/9.2.1. Do not use it with 9.2.0/9.1.0/9.1.1/9.1.2( there is incorrect calculation https://community.splunk.com/t5/Getting-Data-In/Current-dest-host-connection-is-using-18446603427033...).

Labels (1)
Tags (1)

gjanders
SplunkTrust
SplunkTrust

This setting definitely looks useful for slow receivers, but how would I determine when to use it, and an appropriate value?

For example you have mentioned:

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

 I note that you have Warningcount=20, a quick check in my environment shows Warningcount=1, if i'm just seeing the occasional warning I'm assuming tweaking this setting would be of minimal benefit?

 

Furthermore, how would I appropriately set the bytes value?

I'm assuming it's per-pipeline, and the variables involved might relate to volume per-second per-pipline, any other variables?

Any example of how this would be tuned and when?

 

Thanks

0 Karma

hrawat_splunk
Splunk Employee
Splunk Employee

If warning count is 1, then it's not a big issue. 
What it indicates is out of maxQueueSize bytes tcpout queue, one connection has occupied a large space. Thus TcpOutputProcessor will get pauses. maxQueueSize is per pipeline and is shared by all target connections per pipeline.
You may want to increase maxQueueSize( double the size).

0 Karma

gjanders
SplunkTrust
SplunkTrust

Thanks, I'll review the maxQueueSize

If the warning count was higher, such as 20 in your example.

What would be the best way to determine a good value (in bytes) for maxSendQSize to avoid the slow indexer scenario?

0 Karma

hrawat_splunk
Splunk Employee
Splunk Employee

If Warningcount is high, then I would like to see if target receiver/indexer is putting back-pressure. Check if queues blocked on target. If queues not blocked, check on target using netstat

 

 

netstat -an|grep <splunktcp port>

 

 

and see RECV Q, if it's high. If receiver queues are not blocked, but netstat shows RECV Q is full, then receiver need additional pipelines.

If Warningcount is high because there was rolling restart at indexing tier, then set maxSendQSize to some 5% value of maxQueueSize.
Example

 

 

maxSendQSize=2000000
maxQueueSize=50MB

 

 

If using autoLBVolume, then have

maxQueueSize > 5 x autoLBVolume
autoLBVolume > maxSendQSize
Example

 

 

maxQueueSize=50MB
autoLBVolume=5000000
maxSendQSize=2000000

 

 


maxSendQSize is total outstanding raw size of events/chunks in connection queue that needs to be sent to TCP Send-Q. This happens generally when TCP Send-Q is already full.

autoLBVolume is minimum total raw size of events/chunks to be sent to a connection.

gjanders
SplunkTrust
SplunkTrust

One minor request, if this logging is ever enhanced can it please include the output group name.

05-16-2024 03:18:05.992 +0000 WARN AutoLoadBalancedConnectionStrategy [85268 TcpOutEloop] - Current dest host connection <ip address>:9997, oneTimeClient=0, _events.size()=56156, _refCount=1, _waitingAckQ.size()=0, _supportsACK=0, _lastHBRecvTime=Thu May 16 03:18:03 2024 is using 31477941 bytes. Total tcpout queue size is 31457280. Warningcount=1001

Is helpful, however the destination IP happens to be istio (K8s software load balancer) and I have 3 indexer clusters with different DNS names on the same IP/port (the incoming DNS name determines which backend gets used). So my only way to "guess" the outputs.conf stanza involved is to set a unique queue size for each one so I can determine which indexer cluster / output stanza is having the high warning count.

If it had tcpout=<stanzaname> or similar in the warning that would be very helpful for me.

Thanks

0 Karma

gjanders
SplunkTrust
SplunkTrust

Thankyou very much for the detailed reply, that gives me enough to action now.

I appreciate the contributions to the community in this way.

0 Karma
Get Updates on the Splunk Community!

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

The Splunk Community Dashboard Challenge is underway! This is your chance to showcase your skills in creating ...

.conf24 | Session Scheduler is Live!!

.conf24 is happening June 11 - 14 in Las Vegas, and we are thrilled to announce that the conference catalog ...

Introducing the Splunk Community Dashboard Challenge!

Welcome to Splunk Community Dashboard Challenge! This is your chance to showcase your skills in creating ...