Slow indexer/receiver detection capability

hrawat_splunk · ‎04-09-2024

9.1.3/9.2.1 onwards slow indexer/receiver detection capability is fully functional now (SPL-248188, SPL-248140).

https://docs.splunk.com/Documentation/Splunk/9.2.1/ReleaseNotes/Fixedissues
You can enable it on forwarding side in outputs.conf

maxSendQSize = <integer>
* The size of the tcpout client send buffer, in bytes.
  If tcpout client(indexer/receiver connection) send buffer is full,
  a new indexer is randomly selected from the list of indexers provided
  in the server setting of the target group stanza.
* This setting allows forwarder to switch to new indexer/receiver if current
  indexer/receiver is slow.
* A non-zero value means that max send buffer size is set.
* 0 means no limit on max send buffer size.
* Default: 0

Additionally 9.1.3/9.2.1 and above will correctly log target ipaddress causing tcpout blocking.

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

Note: This config works correctly starting 9.1.3/9.2.1. Do not use it with 9.2.0/9.1.0/9.1.1/9.1.2( there is incorrect calculation https://community.splunk.com/t5/Getting-Data-In/Current-dest-host-connection-is-using-18446603427033...).

gjanders

This setting definitely looks useful for slow receivers, but how would I determine when to use it, and an appropriate value?

For example you have mentioned:

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

I note that you have Warningcount=20, a quick check in my environment shows Warningcount=1, if i'm just seeing the occasional warning I'm assuming tweaking this setting would be of minimal benefit?

Furthermore, how would I appropriately set the bytes value?

I'm assuming it's per-pipeline, and the variables involved might relate to volume per-second per-pipline, any other variables?

Any example of how this would be tuned and when?

Thanks

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk

If warning count is 1, then it's not a big issue.
What it indicates is out of maxQueueSize bytes tcpout queue, one connection has occupied a large space. Thus TcpOutputProcessor will get pauses. maxQueueSize is per pipeline and is shared by all target connections per pipeline.
You may want to increase maxQueueSize( double the size).

gjanders

Thanks, I'll review the maxQueueSize

If the warning count was higher, such as 20 in your example.

What would be the best way to determine a good value (in bytes) for maxSendQSize to avoid the slow indexer scenario?

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk

If Warningcount is high, then I would like to see if target receiver/indexer is putting back-pressure. Check if queues blocked on target. If queues not blocked, check on target using netstat

netstat -an|grep <splunktcp port>

and see RECV Q, if it's high. If receiver queues are not blocked, but netstat shows RECV Q is full, then receiver need additional pipelines.

If Warningcount is high because there was rolling restart at indexing tier, then set maxSendQSize to some 5% value of maxQueueSize.
Example

maxSendQSize=2000000
maxQueueSize=50MB

If using autoLBVolume, then have

maxQueueSize > 5 x autoLBVolume
autoLBVolume > maxSendQSize
Example

maxQueueSize=50MB
autoLBVolume=5000000
maxSendQSize=2000000

maxSendQSize is total outstanding raw size of events/chunks in connection queue that needs to be sent to TCP Send-Q. This happens generally when TCP Send-Q is already full.

autoLBVolume is minimum total raw size of events/chunks to be sent to a connection.

gjanders

One minor request, if this logging is ever enhanced can it please include the output group name.

05-16-2024 03:18:05.992 +0000 WARN AutoLoadBalancedConnectionStrategy [85268 TcpOutEloop] - Current dest host connection <ip address>:9997, oneTimeClient=0, _events.size()=56156, _refCount=1, _waitingAckQ.size()=0, _supportsACK=0, _lastHBRecvTime=Thu May 16 03:18:03 2024 is using 31477941 bytes. Total tcpout queue size is 31457280. Warningcount=1001

Is helpful, however the destination IP happens to be istio (K8s software load balancer) and I have 3 indexer clusters with different DNS names on the same IP/port (the incoming DNS name determines which backend gets used). So my only way to "guess" the outputs.conf stanza involved is to set a unique queue size for each one so I can determine which indexer cluster / output stanza is having the high warning count.

If it had tcpout=<stanzaname> or similar in the warning that would be very helpful for me.

Thanks

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

gjanders

Thankyou very much for the detailed reply, that gives me enough to action now.

I appreciate the contributions to the community in this way.

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

Slow indexer/receiver detection capability

other

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

.conf24 | Session Scheduler is Live!!

Introducing the Splunk Community Dashboard Challenge!