Slow indexer/receiver detection capability

hrawat_splunk · ‎04-09-2024

9.1.3/9.2.1 onwards slow indexer/receiver detection capability is fully functional now (SPL-248188, SPL-248140).

https://docs.splunk.com/Documentation/Splunk/9.2.1/ReleaseNotes/Fixedissues
You can enable it on forwarding side in outputs.conf

maxSendQSize = <integer>
* The size of the tcpout client send buffer, in bytes.
  If tcpout client(indexer/receiver connection) send buffer is full,
  a new indexer is randomly selected from the list of indexers provided
  in the server setting of the target group stanza.
* This setting allows forwarder to switch to new indexer/receiver if current
  indexer/receiver is slow.
* A non-zero value means that max send buffer size is set.
* 0 means no limit on max send buffer size.
* Default: 0

Additionally 9.1.3/9.2.1 and above will correctly log target ipaddress causing tcpout blocking.

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

Note: This config works correctly starting 9.1.3/9.2.1. Do not use it with 9.2.0/9.1.0/9.1.1/9.1.2( there is incorrect calculation https://community.splunk.com/t5/Getting-Data-In/Current-dest-host-connection-is-using-18446603427033...).

gjanders · ‎05-08-2024

This setting definitely looks useful for slow receivers, but how would I determine when to use it, and an appropriate value?

For example you have mentioned:

WARN AutoLoadBalancedConnectionStrategy [xxxx TcpOutEloop] - Current dest host connection nn.nn.nn.nnn:9997, oneTimeClient=0, _events.size()=20, _refCount=2, _waitingAckQ.size()=4, _supportsACK=1, _lastHBRecvTime=Thu Jan 20 11:07:43 2024 is using 20214400 bytes. Total tcpout queue size is 26214400. Warningcount=20

I note that you have Warningcount=20, a quick check in my environment shows Warningcount=1, if i'm just seeing the occasional warning I'm assuming tweaking this setting would be of minimal benefit?

Furthermore, how would I appropriately set the bytes value?

I'm assuming it's per-pipeline, and the variables involved might relate to volume per-second per-pipline, any other variables?

Any example of how this would be tuned and when?

Thanks

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk · ‎05-08-2024

If warning count is 1, then it's not a big issue.
What it indicates is out of maxQueueSize bytes tcpout queue, one connection has occupied a large space. Thus TcpOutputProcessor will get pauses. maxQueueSize is per pipeline and is shared by all target connections per pipeline.
You may want to increase maxQueueSize( double the size).

gjanders · ‎05-08-2024

Thanks, I'll review the maxQueueSize

If the warning count was higher, such as 20 in your example.

What would be the best way to determine a good value (in bytes) for maxSendQSize to avoid the slow indexer scenario?

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk · ‎05-09-2024

If Warningcount is high, then I would like to see if target receiver/indexer is putting back-pressure. Check if queues blocked on target. If queues not blocked, check on target using netstat

netstat -an|grep <splunktcp port>

and see RECV Q, if it's high. If receiver queues are not blocked, but netstat shows RECV Q is full, then receiver need additional pipelines.

If Warningcount is high because there was rolling restart at indexing tier, then set maxSendQSize to some 5% value of maxQueueSize.
Example

maxSendQSize=2000000
maxQueueSize=50MB

If using autoLBVolume, then have

maxQueueSize > 5 x autoLBVolume
autoLBVolume > maxSendQSize
Example

maxQueueSize=50MB
autoLBVolume=5000000
maxSendQSize=2000000

maxSendQSize is total outstanding raw size of events/chunks in connection queue that needs to be sent to TCP Send-Q. This happens generally when TCP Send-Q is already full.

autoLBVolume is minimum total raw size of events/chunks to be sent to a connection.

gjanders · ‎05-16-2024

One minor request, if this logging is ever enhanced can it please include the output group name.

05-16-2024 03:18:05.992 +0000 WARN AutoLoadBalancedConnectionStrategy [85268 TcpOutEloop] - Current dest host connection <ip address>:9997, oneTimeClient=0, _events.size()=56156, _refCount=1, _waitingAckQ.size()=0, _supportsACK=0, _lastHBRecvTime=Thu May 16 03:18:03 2024 is using 31477941 bytes. Total tcpout queue size is 31457280. Warningcount=1001

Is helpful, however the destination IP happens to be istio (K8s software load balancer) and I have 3 indexer clusters with different DNS names on the same IP/port (the incoming DNS name determines which backend gets used). So my only way to "guess" the outputs.conf stanza involved is to set a unique queue size for each one so I can determine which indexer cluster / output stanza is having the high warning count.

If it had tcpout=<stanzaname> or similar in the warning that would be very helpful for me.

Thanks

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk · ‎05-21-2024

Is this actual WARN log message you found?

If yes, what was the reason for back-pressure?

gjanders · ‎05-21-2024

Yes that's the actual WARN message, the worst I've seen is a warning count of 9001 with a 150MB queue, the forwarder itself forwards a peak of over 100MB/s

05-21-2024 18:48:47.099 +1000 WARN AutoLoadBalancedConnectionStrategy [264180 TcpOutEloop] - Current dest host connection 10.x.x.x:9997, oneTimeClient=0, _events.size()=131822, _refCount=1, _waitingAckQ.size()=0, _supportsACK=0, _lastHBRecvTime=Tue May 21 18:48:36 2024 is using 157278423 bytes. Total tcpout queue size is 157286400. Warningcount=9001

That went from Warningcount=1 at 18:48:38.538 to Warningcount=1001 at 18:48:38.771
Then 18:48:38.90 has 2001
18:48:39.033 has 3001
18:48:39.134 has 4001
18:48:39.200 has 5001
18:48:39.336 has 6001
18:48:39.553 has 7001
18:48:46.500 has 8001 and finally:
18:48:47.099 has 9001

I suspect the backpressure is caused by an istio pod failure in K8s. I haven't tracked down the cause but I've seen some cases where the istio ingress gateways pods in K8s are in a "not ready" state, however I suspect they were alive enough to take on traffic.

During this time period I will sometimes see higher than normal Warningcount= entries *and* often around the same time my website availability checks start failing to DNS names that are pointed to istio pods.

My current suspect is that's it's not just a Splunk-level backpressure but I'll keep investigating (at the time the indexing tier shows the most utilised TCP input queues were at 67% using a max() measurement on their metrics.log.

The vast majority of my Warningcount= entries on this forwarder show a value of 1.

The configuration for this instance is:

maxQueueSize = 150MB
autoLBVolume = 10485760
autoLBFrequency = 1

dnsResolutionInterval = 259200
# tweaks the connectionsPerTarget = 2 * approx number of indexers
connectionsPerTarget = 96
# As per NLB tuning
heartbeatFrequency = 10
connectionTTL = 75
connectionTimeout = 10

autoLBFrequency = 1
maxSendQSize = 400000

# default 30 seconds, we can retry more quickly with istio as we should move to a new instance if it goes down
backoffOnFailure = 5

The maxSendQSize was tuned for a much lower volume forwarder and I forgot to update it for this instance, so I will increase that, and this instance appears to have increased from 30-50MB/s to closer to 100MB/s so I'll increase the autoLBVolume setting as well

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

hrawat_splunk · ‎05-21-2024

That's great feedback. We will add output group.

gjanders · ‎05-09-2024

Thankyou very much for the detailed reply, that gives me enough to action now.

I appreciate the contributions to the community in this way.

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

Slow indexer/receiver detection capability

other

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?