Log messages received by our central loghost take up to 2 hours before being visible on the indexer.
Our network hardware and servers send their messages via syslog to a central loghost running syslog-ng which filters the messages into their respective files. That totals to about 1600 log files with current data. The log files rotate daily at midnight (their name is in the form of service-20130410.log). Then there is a universal forwarder, that monitors the folders/file for changes and forwards that data to a Splunk indexer. Altogether we index about 12GB of data per day.
A typical monitor stanza looks like this:
[monitor:///logs/splunk/servicetype] host_segment=4 index = main ignoreOlderThan = 3d sourcetype = servicetype
The indexer has a certain search that runs every hour and requires certain data from the past hour. The results of the search at 1AM and 2AM come up empty every day. Ther were also instances of the 3AM search comming up empty. After that point all subsequent searches return data as they should. There is no more delay. I have checked the indexer and it should not be the bottleneck.
Disk I/O, CPU (24 core), and RAM (32GB) should not be a problem on the loghost server. Although the UF is constantly hogging 1 core to the maximum.
There is a delay between the time files are created and when the universal forwarder notices and forwards them.
How can I tune this to speed this up?
Is the 1600 monitored files considered a high or a low number for the universal forwarder?
Kind regards, Mitja
Perhaps you are pushing the envelope a little bit during peak hours. By default, the UF is limited to 256kbps (configurable). 12GB/day averages to 138kbps.
Also, you could have a case of blocked queues on the indexer side;
To find out how the UF is performing when reading the files, you could also check out the REST api on the UF itself;
Also, you should install the S.O.S app, which is great for diagnosing problems...
To expand on what Kristian posted try running this search
index=_internal sourcetype=splunkd "current data throughput" | rex "Current data throughput \((?<kb>\S+)" | eval rate=case(kb < 500, "256", kb > 499 AND kb < 520, "512", kb > 520 AND kb < 770 ,"768", kb>771 AND kb<1210, "1024", 1=1, ">1024") | stats count as Count sparkline as Trend by host, rate | where Count > 4 | rename rate as "Throughput rate(kb)" | sort -"Throughput rate(kb)",-Count
It is one I baked into the app Forwarder Health