Re: Slow indexer/receiver detection capability

hrawat_splunk · ‎04-09-2024

9.1.3/9.2.1 onwards slow indexer/receiver detection capability is fully functional now (SPL-248188, SPL-248140).

https://docs.splunk.com/Documentation/Splunk/9.2.1/ReleaseNotes/Fixedissues
You can enable it on forwarding side in outputs.conf

maxSendQSize = <integer>
* The size of the tcpout client send buffer, in bytes.
  If tcpout client(indexer/receiver connection) send buffer is full,
  a new indexer is randomly selected from the list of indexers provided
  in the server setting of the target group stanza.
* This setting allows forwarder to switch to new indexer/receiver if current
  indexer/receiver is slow.
* A non-zero value means that max send buffer size is set.
* 0 means no limit on max send buffer size.
* Default: 0

Additionally 9.1.3/9.2.1 and above will correctly log target ipaddress causing tcpout blocking.

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

Note: This config works correctly starting 9.1.3/9.2.1. Do not use it with 9.2.0/9.1.0/9.1.1/9.1.2( there is incorrect calculation https://community.splunk.com/t5/Getting-Data-In/Current-dest-host-connection-is-using-18446603427033...).

gjanders

This setting definitely looks useful for slow receivers, but how would I determine when to use it, and an appropriate value?

For example you have mentioned:

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

I note that you have Warningcount=20, a quick check in my environment shows Warningcount=1, if i'm just seeing the occasional warning I'm assuming tweaking this setting would be of minimal benefit?

Furthermore, how would I appropriately set the bytes value?

I'm assuming it's per-pipeline, and the variables involved might relate to volume per-second per-pipline, any other variables?

Any example of how this would be tuned and when?

Thanks

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk

If warning count is 1, then it's not a big issue.
What it indicates is out of maxQueueSize bytes tcpout queue, one connection has occupied a large space. Thus TcpOutputProcessor will get pauses. maxQueueSize is per pipeline and is shared by all target connections per pipeline.
You may want to increase maxQueueSize( double the size).

gjanders

Thanks, I'll review the maxQueueSize

If the warning count was higher, such as 20 in your example.

What would be the best way to determine a good value (in bytes) for maxSendQSize to avoid the slow indexer scenario?

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk

If Warningcount is high, then I would like to see if target receiver/indexer is putting back-pressure. Check if queues blocked on target. If queues not blocked, check on target using netstat

netstat -an|grep <splunktcp port>

and see RECV Q, if it's high. If receiver queues are not blocked, but netstat shows RECV Q is full, then receiver need additional pipelines.

If Warningcount is high because there was rolling restart at indexing tier, then set maxSendQSize to some 5% value of maxQueueSize.
Example

maxSendQSize=2000000
maxQueueSize=50MB

If using autoLBVolume, then have

maxQueueSize > 5 x autoLBVolume
autoLBVolume > maxSendQSize
Example

maxQueueSize=50MB
autoLBVolume=5000000
maxSendQSize=2000000

maxSendQSize is total outstanding raw size of events/chunks in connection queue that needs to be sent to TCP Send-Q. This happens generally when TCP Send-Q is already full.

autoLBVolume is minimum total raw size of events/chunks to be sent to a connection.

gjanders

One minor request, if this logging is ever enhanced can it please include the output group name.

05-16-2024 03:18:05.992 +0000 WARN AutoLoadBalancedConnectionStrategy [85268 TcpOutEloop] - Current dest host connection <ip address>:9997, oneTimeClient=0, _events.size()=56156, _refCount=1, _waitingAckQ.size()=0, _supportsACK=0, _lastHBRecvTime=Thu May 16 03:18:03 2024 is using 31477941 bytes. Total tcpout queue size is 31457280. Warningcount=1001

Is helpful, however the destination IP happens to be istio (K8s software load balancer) and I have 3 indexer clusters with different DNS names on the same IP/port (the incoming DNS name determines which backend gets used). So my only way to "guess" the outputs.conf stanza involved is to set a unique queue size for each one so I can determine which indexer cluster / output stanza is having the high warning count.

If it had tcpout=<stanzaname> or similar in the warning that would be very helpful for me.

Thanks

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk

Is this actual WARN log message you found?

If yes, what was the reason for back-pressure?

gjanders

Yes that's the actual WARN message, the worst I've seen is a warning count of 9001 with a 150MB queue, the forwarder itself forwards a peak of over 100MB/s

05-21-2024 18:48:47.099 +1000 WARN AutoLoadBalancedConnectionStrategy [264180 TcpOutEloop] - Current dest host connection 10.x.x.x:9997, oneTimeClient=0, _events.size()=131822, _refCount=1, _waitingAckQ.size()=0, _supportsACK=0, _lastHBRecvTime=Tue May 21 18:48:36 2024 is using 157278423 bytes. Total tcpout queue size is 157286400. Warningcount=9001

That went from Warningcount=1 at 18:48:38.538 to Warningcount=1001 at 18:48:38.771
Then 18:48:38.90 has 2001
18:48:39.033 has 3001
18:48:39.134 has 4001
18:48:39.200 has 5001
18:48:39.336 has 6001
18:48:39.553 has 7001
18:48:46.500 has 8001 and finally:
18:48:47.099 has 9001

I suspect the backpressure is caused by an istio pod failure in K8s. I haven't tracked down the cause but I've seen some cases where the istio ingress gateways pods in K8s are in a "not ready" state, however I suspect they were alive enough to take on traffic.

During this time period I will sometimes see higher than normal Warningcount= entries *and* often around the same time my website availability checks start failing to DNS names that are pointed to istio pods.

My current suspect is that's it's not just a Splunk-level backpressure but I'll keep investigating (at the time the indexing tier shows the most utilised TCP input queues were at 67% using a max() measurement on their metrics.log.

The vast majority of my Warningcount= entries on this forwarder show a value of 1.

The configuration for this instance is:

maxQueueSize = 150MB
autoLBVolume = 10485760
autoLBFrequency = 1

dnsResolutionInterval = 259200
# tweaks the connectionsPerTarget = 2 * approx number of indexers
connectionsPerTarget = 96
# As per NLB tuning
heartbeatFrequency = 10
connectionTTL = 75
connectionTimeout = 10

autoLBFrequency = 1
maxSendQSize = 400000

# default 30 seconds, we can retry more quickly with istio as we should move to a new instance if it goes down
backoffOnFailure = 5

The maxSendQSize was tuned for a much lower volume forwarder and I forgot to update it for this instance, so I will increase that, and this instance appears to have increased from 30-50MB/s to closer to 100MB/s so I'll increase the autoLBVolume setting as well

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk

That's great feedback. We will add output group.

gjanders

Thankyou very much for the detailed reply, that gives me enough to action now.

I appreciate the contributions to the community in this way.

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

Slow indexer/receiver detection capability

other

Index This | I’m short for "configuration file.” What am I?

New Articles from Academic Learning Partners, Help Expand Lantern’s Use Case Library, ...

Your Guide to SPL2 at .conf24!