I have periodically seen issues where log entries sometimes take a while longer than expected to show up on our indexers (rarely it may take 2-3 hours in some cases.) This morning I observed where some specific logs being monitored had no events forwarded in over 24 hours.
I have a saved search that runs every morning, and when reviewing the results this morning I only saw data for 3 out of our 4 Domain Controllers. Looking into why data was missing from the 4th DC, I ran the search host=hostname earliest=-24h which showed a lot of events BUT only from two different sources (One of the five Windows Event Logs I have configured and a text-based DNS log.)
I checked splukd.log, and did not find anything relevant in troublsehooting this issue. I know of using https://127.0.0.1:8089/services/admin/inputstatus/TailingProcessor:FileStatus to check on the status of monitored files, but do not know of an equivalent to check on monitored Windows Event Logs.
After restarting the Splunk Universal Forwarder, I started to see events on the indexers from most (but not all) Windows Event Logs configured. I'm thinking that since this server has so many events in these logs to process, it is just getting "stuck" trying to keep up with some of the sources.
Does anyone have any helpful techniques to troubleshoot this issue, or possible ways to configure the Universal Forwarder to better keep up with all logs?
series="c:\program files\splunkuniversalforwarder\var\log\splunk\metrics.log", kbps=0.227753, eps=0.190876, kb=7.159180, ev=6, avgage=0.333333, maxage=1
series="c:\program files\splunkuniversalforwarder\var\log\splunk\splunkd.log", kbps=0.002765, eps=0.095438, kb=0.086914, ev=3, avgage=0.333333, maxage=1
series="c:\dnslogs\dnslog.txt", kbps=21.568287, eps=0.668066, kb=677.977539, ev=21, avgage=982.238095, maxage=1737
series="wineventlog:security", kbps=230.208470, eps=197.715849, kb=7236.373047, ev=6215, avgage=4854.831858, maxage=4858
So, over the Security Log, DNS log, and splunk logs, it is close to 256 kbps.
Is it normal for Splunk to focus on one or two logs at a time and ignore the others in cases like this for so long (24+ hours as was the case this morning)? It would seem a better method would be to read less from all configured inputs and be behind in all sources by the same time frame.
I'm not aware of an easy way to prioritize inputs... however, your issue is easily solvable by increasing the default 256KB/s to a higher value in limits.conf:
[thruput] maxKBps = 1024
Choose a value that makes sense for your forwarder's sources, see http://docs.splunk.com/Documentation/Splunk/latest/Admin/Limitsconf for more info.
For debugging this in the future, see the logs sent by the forwarder in the
_internal index, it will tell you that rate limiting was triggered.
I set maxKBps=1024 on the Splunk Forwarder. In metrics.log I noticed throughput went up at times but nowhere near the new limit of 1024 as seen below (in the next comment, due to the character limit per comment on here.) I didn't see any increases in CPU usage by Splunk. Splunk's CPU usage stays between 1-2%, and overall CPU usage ranges between 25-50%. But there is still a lag in events getting indexed.
group=persourcethruput, series="c:\program files\splunkuniversalforwarder\var\log\splunk\metrics.log", kbps=0.237910
group=persourcethruput, series="c:\program files\splunkuniversalforwarder\var\log\splunk\splunkd.log", kbps=0.002765
group=persourcethruput, series="c:\dnslogs\dnslog.txt", kbps=8.165494
group=persourcethruput, series="wineventlog:security", kbps=309.745440
In metrics.log I almost always see references to splunkd.log, metrics.log, dnslog.txt, and the Windows Security log. Even so, the latest entry from the Security log I see in Splunk is from almost an hour ago. Less often I see references to the Windows System and Application logs. The last entry from the System log that is in Splunk was from over 6 hours ago.
I have analysed performance data on the server, and verified that there is no bottleneck in CPU, memory, disk, or network.
Are there any troubleshooting steps I can take to see why the Forwarder is still getting behind on forwarding events? For example, is there some way to see why the Splunk Forwarder is not reading from cetain logs (sources) at a given point in time when there are certainly new events in them? Metrics.log shows which sources are being read from, and the throughput for each one, but not why other sources are not being currently read from.
If you're certain it's the forwarder lagging behind, you can continue debugging here: http://blogs.splunk.com/2011/01/02/did-i-miss-christmas-2/
If not, you could poke around on the indexer for performance issues using the SoS app: http://apps.splunk.com/app/748/