Hi
When I forward some of our netflow traffic to Splunk Stream (dedicated streamfwd). We are around 100 - 300k Flows/s
I see the following Error Message
NetFlowDecoder::decodeFlow Unable to decode flow set data. No template definition received with id 256 from source 2 . Dropping flow data set of size 416
I assume this is because some of the templates are missing after a few minutes I get the following messages
agentMode: 1
level: ERROR
message: Netflow processing queues are full for NetflowReceiver #2. Dropped 274671 packets
This problem shows up when I forward some of our high volume IPFIX Netflow data.
This problem does not show up when I forward the lower volume Netflow v9. These just show the template definition errors which vanish after some time.
CPU does not seem to be the problem, since the load is not maxing out the cores yet ~ 700 - 750%
I assume that there are some buffering issues. There were apparently similar issues when using nfsen.
Is there a way to increase the buffers?
My streamfwd.conf
[streamfwd]
ipAddr = 0.0.0.0
processingThreads = 32
dedicatedCatureMode = 0
httpRequestSenderThreads=4
httpRequestSenderConnections=40
#netflowReceiver.0.port = 3000
#netflowReceiver.0.protocol = udp
#netflowReceiver.0.ip = 192.168.20.5
#netflowReceiver.0.decoder = netflow
netflowReceiver.0.port = 3001
netflowReceiver.0.protocol = udp
netflowReceiver.0.ip = 192.168.20.5
netflowReceiver.0.decoder = netflow
netflowReceiver.1.port = 3002
netflowReceiver.1.protocol = udp
netflowReceiver.1.ip = 192.168.20.5
netflowReceiver.1.decoder = netflow
netflowReceiver.2.port = 3011
netflowReceiver.2.protocol = udp
netflowReceiver.2.ip = 192.168.20.5
netflowReceiver.2.decoder = netflow
netflowReceiver.3.port = 3012
netflowReceiver.3.protocol = udp
netflowReceiver.3.ip = 192.168.20.5
netflowReceiver.3.decoder = netflow
netflowReceiver.4.port = 3013
netflowReceiver.4.protocol = udp
netflowReceiver.4.ip = 192.168.20.5
netflowReceiver.4.decoder = netflow
netflowReceiver.5.port = 3014
netflowReceiver.5.protocol = udp
netflowReceiver.5.ip = 192.168.20.5
netflowReceiver.5.decoder = netflow
netflowReceiver.6.port = 3021
netflowReceiver.6.protocol = udp
netflowReceiver.6.ip = 192.168.20.5
netflowReceiver.6.decoder = netflow
netflowReceiver.7.port = 3022
netflowReceiver.7.protocol = udp
netflowReceiver.7.ip = 192.168.20.5
netflowReceiver.7.decoder = netflow
netflowReceiver.8.port = 3023
netflowReceiver.8.protocol = udp
netflowReceiver.8.ip = 192.168.20.5
netflowReceiver.8.decoder = netflow
@mathiask , in which log file did you see this message,
agentMode: 1
level: ERROR
message: Netflow processing queues are full for NetflowReceiver #2. Dropped 274671 packets
Configuring processing threads in Stream is a bit complicated: processingThreads
parameter sets the number of "regular" Stream's passive packet processing/deep packet inspection threads, while Netflow processing threads are configured using a different parameter - netflowReceiver.0.decodingThreads = NN
(I added it to the documentation http://docs.splunk.com/Documentation/StreamApp/7.0.1/DeployStreamApp/ConfigureFlowcollector#Configur...)
Sorry about causing this confusion.
On a side note, a single neflow listening socket with sufficient number of decoding threads should be able to handle 100K-300K netflow records/sec, so I believe you should not need to configure 9 listening sockets. I'd also recommend adding a load balancer between Stream forwarder and your HEC-enabled indexers to fan out Stream netflow events.
@HEC output
Yeah, this would be another issue. But we currently only very selectively forward data for indexing, the volume is just one of the issues.
Licencing aside, based on Splunk reference HW (100 GB/day) this would require ~200 indexers (i.e. 2400 cores), without redundancy, Splunk Enterprise might not be the 'right' solution for this.
It would be totally awesome though ..
Reality aside: Are the events not automatically load balanced when I configure multiple receiving indexers in the Distributed Forwarder Management?
Or is it more like Round Robin, i.e. DoS on indexer 1 then on indexer 2 and so on, instead of sending events all over all the time.
@fanning out incoming netflow traffic
It is pretty easy thing for us to do, as we receive all the netflow traffic on one instance on our cloud first and then distribute it to different applications. We do this for several reasons, if you are interested I can elaborate
We use our UDP samplicator https://github.com/sleinen/samplicator
I now upgraded the instance to 16 Cores and 16 GB RAM
@netflowReceiver.MM.decodingThreads = NN
Thanks for documenting the parameter, I already discovered the parameter in http://docs.splunk.com/Documentation/StreamApp/7.0.1/DeployStreamApp/Performancetestresults
But probably I did not use it correctly...
It seems to be working, I do not see any drop messages any more from that host.
I now reduced the configuration to one socket, leading to the following observations
I suspect that we actually loose data, which seems to be true when I check
cat /proc/net/udp
I can see increasing number of UDP drops on the receiving Interface
Increasing the UDP buffers
sudo sysctl -w net.core.rmem_max=<buffer size>
sudo sysctl -w net.core.rmem_default=<buffer size>
Single Socket
Multiple Socket
It looks like with a large enough buffer ... it is all the same. The less estimated indexed volume I attribute due to the time, usually it declines a bit.
I will leave it running over a few days to see how it keeps up.