I am seeing an issue with missing logs when the forwarder agent is in busy periods, which is blocking us from investigating a serious production issue.
This is happening environment wide on all of our servers providing this service.
It looks as if the forwarder is hitting the limit of what it is able to forward. This is clear cut as i see gaps of approx 30 seconds in one source whilst it is being replaced with another source. We can see locally on the machine that these logs are in fact being written to, which confirms that the logs are being missed. Can also see that they are not arriving at the indexers which confirms that the problem lies with the forwarder. We are using useACK.
Unfortunately there is no mention as far as i can see within the logs on the forwarder of any limits being hit or items being removed from the queue or even just dropped. Of course there are no errors i can see in the Splunk servers as they are unaware of the problem anyway.
One of these logs rotates daily and is currently at about 1.8GB, other logs are currently rolling at approx 20 mins, but can be up to every 3 mins. The main source that is missing only rolls daily at midnight, so rolling for this source is not the issue.
There were some connection issues and limits being hit, but these were resolved last week and warnings are now absent from the logs.
How does Splunk behave if the forwarder is busy? Would it simply drop events without noticing? Is there any way the performance can be tuned? Or any config that may be causing the issue?
To add some environment context, we are running a cluster of 8 Indexers with 1 active Search head. The cluster is made up by 6.1.1 Linux 64-bit peers, while the UFs are 6.1.3 Linux 64-bit hosts.
Issue has been fixed and actually logs were not lost, also considering that useACK was enabled and no further connectivity issues were reported. The timestamp extraction configuration on a couple of new IDXs was wrong. These IDXs didn't get the right props.conf configuration as the necessary setting had neither been deployed manually nor put on the DS to be pushed out. After correcting props.conf on the involved IDXs and restarting them, the issue was solved.
Issue has been fixed and actually logs were not lost, also considering that useACK was enabled and no further connectivity issues were reported. The timestamp extraction configuration on a couple of new IDXs was wrong. These IDXs didn't get the right props.conf configuration as the necessary setting had neither been deployed manually nor put on the DS to be pushed out. After correcting props.conf on the involved IDXs and restarting them, the issue was solved.