All Apps and Add-ons

Why the Packet Queue Size increase continuously in Stream App ?

swbaik
New Member

Hello,

The Stream config is HTTP and no Aggregation and has src, dest, content, time_taken fields..

The packet queue size chart of Stream Forwarder Metrics dashboard has accumulated to a large value and Independent Stream Forwarder has out-of-memory of streamfwd process.

What is the packet queuing of packet queue size chart?
How does Stream App analyze for TCP flows of a TCP session?
How about packet loss, long response delay, abnormal tcp session close and so on?
Is there the Stream App not analyze which packet flow type?

Thanks.

Tags (2)
0 Karma

mdickey_splunk
Splunk Employee
Splunk Employee

How much traffic are you trying to capture, what are your system specs, and how many ProcessorThreads do you have configured? Typically, this is caused by not assigning enough threads (corresponding to cpu cores) to process the packets coming in.

0 Karma

swbaik
New Member

Hello mdickey,

Independent Stream Forwarder : 20 core, 64GB mem, Centos 7.1, Splunk 6.5.5, Stream 7.1.1
Traffic : (Avg) 2Gbps
streamfwd.conf :
[streamfwd]
processingThreads = 8
maxEventQueueSize = 1000000000
maxPacketQueueSize = 268435456
maxTcpReassemblyPacketCount = 500000000
tcpConnectionTimeout = 120
maxEventAttributes = 2000

The packets come from a aggregator that is merging and filtering it.
I am wondering why packets in the packet queue was delayed.

Thanks to your support.
Sungwook

0 Karma

mdickey_splunk
Splunk Employee
Splunk Employee

Wow, those are really big numbers! I would definitely not recommend setting those so big. It's hard to say exactly what the issue is withougt digging further into logs or pcaps (best to do that with your SE, if necessary), but here are two possibilities:

  1. You are running stream as a modular input via splunk. There is a pretty low performance bottleneck with this architecture, that you would likely run into at 2 Gbps. It will create back-pressure in the event queue, which normally would just cause events to drop with errors. But with these settings you're going to blow out memory, slow everything down, and cause everything to start failing, pretty quick. The only way to make this work is to do more filtering/aggregation at the edge (so far fewer events are going to splunk) OR using an independent agent configuration. The latter we've tested upwards of 10 Gbps, and is absolutely a requirement to scale stream.
  2. Your aggregator isn't sending all the packets necessary. For example, it may be dropping FIN packets causing things to never close out in reassembly. Lowering tcpConnectionTimeout (and corresponding udp timeout values) may help work-around this. Unless you really can't tolerate premature termination of those flows, I'd recommend lower those regardless. Values as low as 10 are perfectly reasonable since this is an inactivity timeout. Another thing I've seen a lot is configs that only forward ingress packets, or only forward egress packets. So the flows sit indefinitely waiting for the other side of the "conversation" to arrive. This is easy to check/diagnose using something like tcpdump.
  3. You are doing a lot of decryption or something that requires a lot of processing early in the pipeline, aand need more processorThreads. Since you have 20, you could try increasing that to see if things improve.
0 Karma

swbaik
New Member

I am wondering the tcpConnectionTimeout option.
If the tcpConnectionTimeout value changes to 10, doesn’t the number of event increase higher than before?
I think if there are packets after 10 second in a tcp session, it will happen new events with time_taken and byte_in value is 0 due to no a request packet.
Is it right?

Thanks,
Sungwook

0 Karma

mdickey_splunk
Splunk Employee
Splunk Employee

That is correct; any connections with no activity/packets for > tcpConnectionTimeout would likely result in multiple events being generated for the same connection.

0 Karma

swbaik
New Member

If so (tcpConnectionTimeout 10), does the fragmented packets disappear in the packet queue?
Except of system performance, Could it(not reassembled) accumulated in packet queue?

Thanks to your reply.
Sungwook

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...