When I forward some of our netflow traffic to Splunk Stream (dedicated streamfwd). We are around 100 - 300k Flows/s
I see the following Error Message
NetFlowDecoder::decodeFlow Unable to decode flow set data. No template definition received with id 256 from source 2 . Dropping flow data set of size 416
I assume this is because some of the templates are missing after a few minutes I get the following messages
agentMode: 1 level: ERROR message: Netflow processing queues are full for NetflowReceiver #2. Dropped 274671 packets
This problem shows up when I forward some of our high volume IPFIX Netflow data.
This problem does not show up when I forward the lower volume Netflow v9. These just show the template definition errors which vanish after some time.
CPU does not seem to be the problem, since the load is not maxing out the cores yet ~ 700 - 750%
I assume that there are some buffering issues. There were apparently similar issues when using nfsen.
Is there a way to increase the buffers?
[streamfwd] ipAddr = 0.0.0.0 processingThreads = 32 dedicatedCatureMode = 0 httpRequestSenderThreads=4 httpRequestSenderConnections=40 #netflowReceiver.0.port = 3000 #netflowReceiver.0.protocol = udp #netflowReceiver.0.ip = 192.168.20.5 #netflowReceiver.0.decoder = netflow netflowReceiver.0.port = 3001 netflowReceiver.0.protocol = udp netflowReceiver.0.ip = 192.168.20.5 netflowReceiver.0.decoder = netflow netflowReceiver.1.port = 3002 netflowReceiver.1.protocol = udp netflowReceiver.1.ip = 192.168.20.5 netflowReceiver.1.decoder = netflow netflowReceiver.2.port = 3011 netflowReceiver.2.protocol = udp netflowReceiver.2.ip = 192.168.20.5 netflowReceiver.2.decoder = netflow netflowReceiver.3.port = 3012 netflowReceiver.3.protocol = udp netflowReceiver.3.ip = 192.168.20.5 netflowReceiver.3.decoder = netflow netflowReceiver.4.port = 3013 netflowReceiver.4.protocol = udp netflowReceiver.4.ip = 192.168.20.5 netflowReceiver.4.decoder = netflow netflowReceiver.5.port = 3014 netflowReceiver.5.protocol = udp netflowReceiver.5.ip = 192.168.20.5 netflowReceiver.5.decoder = netflow netflowReceiver.6.port = 3021 netflowReceiver.6.protocol = udp netflowReceiver.6.ip = 192.168.20.5 netflowReceiver.6.decoder = netflow netflowReceiver.7.port = 3022 netflowReceiver.7.protocol = udp netflowReceiver.7.ip = 192.168.20.5 netflowReceiver.7.decoder = netflow netflowReceiver.8.port = 3023 netflowReceiver.8.protocol = udp netflowReceiver.8.ip = 192.168.20.5 netflowReceiver.8.decoder = netflow
Configuring processing threads in Stream is a bit complicated:
processingThreads parameter sets the number of "regular" Stream's passive packet processing/deep packet inspection threads, while Netflow processing threads are configured using a different parameter -
netflowReceiver.0.decodingThreads = NN (I added it to the documentation http://docs.splunk.com/Documentation/StreamApp/7.0.1/DeployStreamApp/ConfigureFlowcollector#Configur...)
Sorry about causing this confusion.
On a side note, a single neflow listening socket with sufficient number of decoding threads should be able to handle 100K-300K netflow records/sec, so I believe you should not need to configure 9 listening sockets. I'd also recommend adding a load balancer between Stream forwarder and your HEC-enabled indexers to fan out Stream netflow events.
Yeah, this would be another issue. But we currently only very selectively forward data for indexing, the volume is just one of the issues.
Licencing aside, based on Splunk reference HW (100 GB/day) this would require ~200 indexers (i.e. 2400 cores), without redundancy, Splunk Enterprise might not be the 'right' solution for this.
It would be totally awesome though ..
Reality aside: Are the events not automatically load balanced when I configure multiple receiving indexers in the Distributed Forwarder Management?
Or is it more like Round Robin, i.e. DoS on indexer 1 then on indexer 2 and so on, instead of sending events all over all the time.
@fanning out incoming netflow traffic
It is pretty easy thing for us to do, as we receive all the netflow traffic on one instance on our cloud first and then distribute it to different applications. We do this for several reasons, if you are interested I can elaborate
We use our UDP samplicator https://github.com/sleinen/samplicator
I now upgraded the instance to 16 Cores and 16 GB RAM
@netflowReceiver.MM.decodingThreads = NN
Thanks for documenting the parameter, I already discovered the parameter in http://docs.splunk.com/Documentation/StreamApp/7.0.1/DeployStreamApp/Performancetestresults
But probably I did not use it correctly...
It seems to be working, I do not see any drop messages any more from that host.
I now reduced the configuration to one socket, leading to the following observations
I suspect that we actually loose data, which seems to be true when I check
I can see increasing number of UDP drops on the receiving Interface
Increasing the UDP buffers
sudo sysctl -w net.core.rmem_max=<buffer size> sudo sysctl -w net.core.rmem_default=<buffer size>
It looks like with a large enough buffer ... it is all the same. The less estimated indexed volume I attribute due to the time, usually it declines a bit.
I will leave it running over a few days to see how it keeps up.