I have two identical Linux appliances with "capture software A" installed on appliance 1 and Splunk Stream installed on appliance 2.
They both connected in an identical way to datacenter switch TAP ports to monitor and capture HTTP traffic.
What i found is that "capture software A" captures packets: 1,2,3,4,5
while Splunk Stream captures packets 1,2,4,5
I am not excluding the fact that packet "3" might have never arrived to "appliance 2" or may be dropped by ETH interface before reaching splunk stream layer.
Is there a way to diagnose the fact that network interface is working as expected or perhaps pushing its limits and dropping packets?
Would be great to have some way to diagnose the reliability and completeness of capturing process as we're dealing with online banking portal here.
Gleb
Approximately how much traffic (bps) are you capturing, and do you see any WARN or ERROR messages in index=_internal sourcetype=stream:log such as "Max packet queue size exceeded?"
Stream tracks an internal metric called DroppedPackets that it records to index=_internal sourcetype=stream:stats. This represents the number of packets received by the network interface but not processed. You can get a report on this using the following search:
index=_internal sourcetype=stream:stats | spath Output=DroppedPackets path=sniffer{}.captures{}.droppedPackets | eventstats sum(DroppedPackets) by _cd | rename sum(DroppedPackets) as SumDroppedPackets | streamstats current=t global=f window=2 earliest(SumDroppedPackets) as prev latest(SumDroppedPackets) as curr by host | eval delta=curr-prev | eval absdelta=case(delta<=0, 0, delta>0, delta) | timechart sum(absdelta) as delta by host
If this seems high, try upgrading to 6.1.1 as it fixes most issues related to DroppedPackets.
Approximately how much traffic (bps) are you capturing, and do you see any WARN or ERROR messages in index=_internal sourcetype=stream:log such as "Max packet queue size exceeded?"
Stream tracks an internal metric called DroppedPackets that it records to index=_internal sourcetype=stream:stats. This represents the number of packets received by the network interface but not processed. You can get a report on this using the following search:
index=_internal sourcetype=stream:stats | spath Output=DroppedPackets path=sniffer{}.captures{}.droppedPackets | eventstats sum(DroppedPackets) by _cd | rename sum(DroppedPackets) as SumDroppedPackets | streamstats current=t global=f window=2 earliest(SumDroppedPackets) as prev latest(SumDroppedPackets) as curr by host | eval delta=curr-prev | eval absdelta=case(delta<=0, 0, delta>0, delta) | timechart sum(absdelta) as delta by host
If this seems high, try upgrading to 6.1.1 as it fixes most issues related to DroppedPackets.
Thank you so much!
This was of a great help.
I just briefly looked into that and see dropped packets are into hundreds of thousands (of total of approx 1.5M events per day)
Will try to do an upgrade and see if there is an improvement.
I also noticed that few times my realtime alerts were not triggered, even though the data was indexed. When trying to run alert query manually - it finds alertable results for which no alerts were triggered before. Not sure if this somehow related but it seems some sort of congestion is going on.
It could be.. if your indexers are having problems, it can create a backlog that blocks stream, which in turn would result lots of missing packets. You really should not be seeing ANY DroppedPackets at all.