Hi,
We had the same problem, but we had latencies ranging from minutes up to 15 hours, depending on the traffic load on the checkpoint (this was pre-upgrade on R77.30). It seemed that at more than 41000 events per minute, we experienced a build-up of latency. Performance and resource usage of the HF, the indexers or the checkpoint management server were just fine, no obvious culprit and no errors or warnings in opseclea:log:modinput.
We have NMON running on the HF, so inspecting the CPU and memory usage was easy. Splunkd didn't do much (on average 0.35), and lea_lograbber was close to nothing (0.03). Python however used 1.6-1.7 CPU cores continuously. This was the maximum that i observed, which always happened at more than 41000 events per minute. Below the 41000 epm the process used less resources.
After the upgrade to R80.10 it got even worse with latency running up to 22 hours and climbing.
(I think) I managed to solve this by just setting the log level to INFO, instead of DEBUG (which I assumed was necessary for "debugging" this problem...). The debugging resulted in half a million events per minute of _internal debug logging...
After changing this level, and setting the starttime on each input to a time a couple hours before (thus skipping most of the 22 hours of latency), the CPU usage of python was only 0.7 CPU core.... And fw1_loggrabber and splunkd spiked to levels not seen before (both at 2 CPU cores each). Around the same moment the Metrics log reports that it indexed 1.3 million events/minute for a couple of minutes. It seems that the DEBUG log level (very) negatively impacted the maximum events that python/lea_loggrabber could retrieve/send to splunk.
Some time later the resource usage came down; splunkd and lea_loggrabber run both at 0.1 core, python at 0.03... That is at around 70000 events per minute.
I'll be monitoring closely what happens under more load, but for now it seems all right.
Action01
... View more