The port is open and listening. The logs directory which I am monitoring contain hundreds of thousands of files in it. But for some reason, the monitor stops monitoring and no longer picks up new log files, or even ones that exist. It's very frustrating and makes me think that the forwarder is unreliable, which for us is a nightmare as we need it to alert us to the fact that no new jobs have been completed.
the monitor is as follows in the inputs.conf
[monitor:///data/logs disabled = false host = MFT index = support blacklist = goanywhere* drwxr-xr-x 18 esb esb 4096 Dec 20 09:08 . drwxr-xr-x 21 esb esb 4096 Nov 14 23:30 .. drwxr-xr-x 2 esb esb 2097152 Dec 20 13:33 2016-12-10 drwxr-xr-x 2 esb esb 2109440 Dec 20 18:51 2016-12-11 drwxr-xr-x 2 esb esb 2162688 Dec 12 23:57 2016-12-12 drwxr-xr-x 2 esb esb 2093056 Dec 13 23:57 2016-12-13 drwxr-xr-x 2 esb esb 2211840 Dec 15 02:08 2016-12-14 drwxr-xr-x 2 esb esb 2183168 Dec 15 23:58 2016-12-15 drwxr-xr-x 2 esb esb 2023424 Dec 16 23:58 2016-12-16 drwxr-xr-x 2 esb esb 2043904 Dec 17 23:59 2016-12-17 drwxr-xr-x 2 esb esb 2072576 Dec 18 23:58 2016-12-18 drwxr-xr-x 2 esb esb 2015232 Dec 19 23:58 2016-12-19 drwxr-xr-x 2 esb esb 1802240 Dec 20 18:51 2016-12-20
inside the directories are files which look like this
-rw-r--r-- 1 esb esb 10300 Dec 20 18:54 1000001209329.log -rw-r--r-- 1 esb esb 977 Dec 20 18:54 1000001209330_error_1.log -rw-r--r-- 1 esb esb 10317 Dec 20 18:54 1000001209330.log -rw-r--r-- 1 esb esb 977 Dec 20 18:54 1000001209331_error_1.log -rw-r--r-- 1 esb esb 10325 Dec 20 18:54 1000001209331.log -rw-r--r-- 1 esb esb 971 Dec 20 18:54 1000001209332_error_1.log -rw-r--r-- 1 esb esb 10273 Dec 20 18:54 1000001209332.log -rw-r--r-- 1 esb esb 971 Dec 20 18:54 1000001209333_error_1.log -rw-r--r-- 1 esb esb 8120 Dec 20 18:55 1000001209333.log -rw-r--r-- 1 esb esb 977 Dec 20 18:54 1000001209334_error_1.log -rw-r--r-- 1 esb esb 10293 Dec 20 18:54 1000001209334.log -rw-r--r-- 1 esb esb 977 Dec 20 18:55 1000001209335_error_1.log -rw-r--r-- 1 esb esb 10214 Dec 20 18:55 1000001209335.log -rw-r--r-- 1 esb esb 977 Dec 20 18:55 1000001209336_error_1.log -rw-r--r-- 1 esb esb 10309 Dec 20 18:55 1000001209336.log -rw-r--r-- 1 esb esb 977 Dec 20 18:55 1000001209337_error_1.log -rw-r--r-- 1 esb esb 8131 Dec 20 18:55 1000001209337.log -rw-r--r-- 1 esb esb 971 Dec 20 18:55 1000001209338_error_1.log -rw-r--r-- 1 esb esb 9712 Dec 20 18:55 1000001209338.log -rw-r--r-- 1 esb esb 977 Dec 20 18:55 1000001209339_error_1.log -rw-r--r-- 1 esb esb 6320 Dec 20 18:55 1000001209339.log -rw-r--r-- 1 esb esb 971 Dec 20 18:55 1000001209340_error_1.log -rw-r--r-- 1 esb esb 0 Dec 20 18:55 1000001209340.log -rw-r--r-- 1 esb esb 0 Dec 20 18:55 1000001209341.log
one directory has 200,000 files in it.
I'm stumped, as i said the forwarder just gives up
200,000 files are just too much. What is the H/W configuration of the forwarder and how is the cpu usage (bet it's very high)?
Absolutely, way too many files and the forwarder can't handle it. You can validate that actually the forwarder is fine by setting
ignoreOlderThan to a day or so - inputs.conf.spec
When too many files are present, the forwarder just hangs.
What is your
ulimit -n @gymmynzl ?
ulimit -Hn is 64000
its only monitoring 5 directories, but in thoes directories each might have 50K log files in them
Right right. Try please
ignoreOlderThan = 1d just to validate... I'm going through the exact same thing here...
ulimit -Hn as 64000 is good.
ok same issue and there is only 45000 files and it still chokes, so what is the max that they recommend ?
OMG, "only" 45,000 files! There is no absolute "choke" number, but I would not go above 10K active files. (Although perhaps I am conservative for the latest version of Splunk. But still...)
Realize that the Splunk forwarder is examining every one of these 45,000 files to determine if it has been updated. So for each file, Splunk has to keep a record of "what was the prior file status: hash, size, last byte forwarded, last mod time, etc" and it also must access the current file information. It uses a hash to identify the files, and so the hash must be calculated for each file it examines - Splunk can't just look at the directory/inode information. Splunk uses the normal OS mechanisms to do all of this. There isn't any magic.
And, the forwarder does this repeatedly over the set of monitored files, as quickly as it can. Of course, there is some parallelization, but...
The problem becomes that the forwarder uses incrementally more memory and CPU as the number of files increases. Eventually, it starts peaking the CPU and simply can't keep up. If your forwarder is using more than about 5-10% of the CPU, it's not working very efficiently. If it's over 50%, the forwarder is probably thrashing somehow. (I don't know the internal details.)
Setting the ulimits higher may help, but only to a point. Clearly, you can set the ulimit to 64000 files, but that doesn't mean that the forwarder can manage that many! [Note that you do want a high ulimit on an indexer!]
This problem occurs even if only a small percentage of the 45,000 files are actively being updated. Splunk still has to examine every file to determine its status. So here are ideas for mitigating the problem:
1 - Keep a separate directory tree for logs that are no longer updated. For example, if the current logs are in /var/log, rotate old log files into /var/oldlog. (not /var/log/old) However, be sure to keep the current and most recent log in the current directory, in case Splunk is finishing up with a log file when it rotates.
2 - Particularly if you can't do good log file management, use the
ignoreOlderThan = 1d setting in inputs.conf
This tells the forwarder to simply ignore files with last modification times greater than the setting. This will help a great deal - if the number of active files is relatively small. And of course, if the system never modifies the older files.
3 - If the number of active files is really so high, consider some other options. For example: Are you creating a ton of little files and could they be combined? Or - could you run multiple forwarders and have each forwarder monitor a subset of the files? The second option might work if each forwarder instance is not overloaded and the server has enough resources to run all the forwarders in parallel.
4 - Are you using a heavy forwarder? Avoid the heavy forwarder unless its use is absolutely required. The Universal Forwarder is more efficient and has much greater throughput. It uses less CPU and memory, plus it uses fewer network resources per event. The UF has a 256 kbps bandwidth limit, but the limit is easily removed using limits.conf