All Apps and Add-ons

Splunk Stream: Why do I receive the error "Unable to decode flow set data" when forwarding high volume NetFlow data?

Communicator

Hi

When I forward some of our netflow traffic to Splunk Stream (dedicated streamfwd). We are around 100 - 300k Flows/s
I see the following Error Message

NetFlowDecoder::decodeFlow Unable to decode flow set data. No template definition received with id 256 from source 2 . Dropping flow data set of size 416

I assume this is because some of the templates are missing after a few minutes I get the following messages

agentMode:   1  
 level:  ERROR  
 message:    Netflow processing queues are full for NetflowReceiver #2. Dropped 274671 packets  

This problem shows up when I forward some of our high volume IPFIX Netflow data.
This problem does not show up when I forward the lower volume Netflow v9. These just show the template definition errors which vanish after some time.

CPU does not seem to be the problem, since the load is not maxing out the cores yet ~ 700 - 750%

I assume that there are some buffering issues. There were apparently similar issues when using nfsen.
Is there a way to increase the buffers?

My streamfwd.conf

[streamfwd]
ipAddr = 0.0.0.0
processingThreads = 32
dedicatedCatureMode = 0
httpRequestSenderThreads=4
httpRequestSenderConnections=40

#netflowReceiver.0.port = 3000
#netflowReceiver.0.protocol = udp
#netflowReceiver.0.ip = 192.168.20.5
#netflowReceiver.0.decoder = netflow

netflowReceiver.0.port = 3001
netflowReceiver.0.protocol = udp
netflowReceiver.0.ip = 192.168.20.5
netflowReceiver.0.decoder = netflow

netflowReceiver.1.port = 3002
netflowReceiver.1.protocol = udp
netflowReceiver.1.ip = 192.168.20.5
netflowReceiver.1.decoder = netflow

netflowReceiver.2.port = 3011
netflowReceiver.2.protocol = udp
netflowReceiver.2.ip = 192.168.20.5
netflowReceiver.2.decoder = netflow

netflowReceiver.3.port = 3012
netflowReceiver.3.protocol = udp
netflowReceiver.3.ip = 192.168.20.5
netflowReceiver.3.decoder = netflow

netflowReceiver.4.port = 3013
netflowReceiver.4.protocol = udp
netflowReceiver.4.ip = 192.168.20.5
netflowReceiver.4.decoder = netflow

netflowReceiver.5.port = 3014
netflowReceiver.5.protocol = udp
netflowReceiver.5.ip = 192.168.20.5
netflowReceiver.5.decoder = netflow

netflowReceiver.6.port = 3021
netflowReceiver.6.protocol = udp
netflowReceiver.6.ip = 192.168.20.5
netflowReceiver.6.decoder = netflow

netflowReceiver.7.port = 3022
netflowReceiver.7.protocol = udp
netflowReceiver.7.ip = 192.168.20.5
netflowReceiver.7.decoder = netflow

netflowReceiver.8.port = 3023
netflowReceiver.8.protocol = udp
netflowReceiver.8.ip = 192.168.20.5
netflowReceiver.8.decoder = netflow
0 Karma

Splunk Employee
Splunk Employee

Configuring processing threads in Stream is a bit complicated: processingThreads parameter sets the number of "regular" Stream's passive packet processing/deep packet inspection threads, while Netflow processing threads are configured using a different parameter - netflowReceiver.0.decodingThreads = NN (I added it to the documentation http://docs.splunk.com/Documentation/StreamApp/7.0.1/DeployStreamApp/ConfigureFlowcollector#Configur...)

Sorry about causing this confusion.

On a side note, a single neflow listening socket with sufficient number of decoding threads should be able to handle 100K-300K netflow records/sec, so I believe you should not need to configure 9 listening sockets. I'd also recommend adding a load balancer between Stream forwarder and your HEC-enabled indexers to fan out Stream netflow events.

Communicator

@HEC output
Yeah, this would be another issue. But we currently only very selectively forward data for indexing, the volume is just one of the issues.
Licencing aside, based on Splunk reference HW (100 GB/day) this would require ~200 indexers (i.e. 2400 cores), without redundancy, Splunk Enterprise might not be the 'right' solution for this.
It would be totally awesome though ..

Reality aside: Are the events not automatically load balanced when I configure multiple receiving indexers in the Distributed Forwarder Management?
Or is it more like Round Robin, i.e. DoS on indexer 1 then on indexer 2 and so on, instead of sending events all over all the time.

@fanning out incoming netflow traffic
It is pretty easy thing for us to do, as we receive all the netflow traffic on one instance on our cloud first and then distribute it to different applications. We do this for several reasons, if you are interested I can elaborate
We use our UDP samplicator https://github.com/sleinen/samplicator

0 Karma

Communicator

I now upgraded the instance to 16 Cores and 16 GB RAM

@netflowReceiver.MM.decodingThreads = NN
Thanks for documenting the parameter, I already discovered the parameter in http://docs.splunk.com/Documentation/StreamApp/7.0.1/DeployStreamApp/Performancetestresults
But probably I did not use it correctly...

It seems to be working, I do not see any drop messages any more from that host.

  • Decoding occupies 10 - 14 cores
  • estimated Indexed volume 18 TB/day

I now reduced the configuration to one socket, leading to the following observations

  • Decoding only needs 9 - 12 cores
  • estimated Indexed volume 14 TB/day

I suspect that we actually loose data, which seems to be true when I check

cat /proc/net/udp

I can see increasing number of UDP drops on the receiving Interface

Increasing the UDP buffers

sudo sysctl -w net.core.rmem_max=<buffer size>
sudo sysctl -w net.core.rmem_default=<buffer size>

Single Socket

  • buffer size = 524288000, some loss
  • buffer size = 1073741824, very little to no loss
  • CPU utilization 800 - 1600 %, generally around 1300 %
  • estimated indexed volume 19 TB/day, estimated over 15 min 1615 - 1630

Multiple Socket

  • buffer size = 104857600, very little to no loss
  • CPU utilization 1000 - 1400%, generally around 1300%
  • estimated indexed volume 17.5 TB/day, estimated over 15 min 1705 - 1720

It looks like with a large enough buffer ... it is all the same. The less estimated indexed volume I attribute due to the time, usually it declines a bit.

I will leave it running over a few days to see how it keeps up.

0 Karma