Why does the Splunk Forwarder for Linux randomly s...

gymmynzl · ‎12-20-2016

The port is open and listening. The logs directory which I am monitoring contain hundreds of thousands of files in it. But for some reason, the monitor stops monitoring and no longer picks up new log files, or even ones that exist. It's very frustrating and makes me think that the forwarder is unreliable, which for us is a nightmare as we need it to alert us to the fact that no new jobs have been completed.

the monitor is as follows in the inputs.conf

[monitor:///data/logs
disabled = false
host = MFT
index = support
blacklist = goanywhere*

drwxr-xr-x 18 esb esb 4096 Dec 20 09:08 .
drwxr-xr-x 21 esb esb 4096 Nov 14 23:30 ..
drwxr-xr-x 2 esb esb 2097152 Dec 20 13:33 2016-12-10
drwxr-xr-x 2 esb esb 2109440 Dec 20 18:51 2016-12-11
drwxr-xr-x 2 esb esb 2162688 Dec 12 23:57 2016-12-12
drwxr-xr-x 2 esb esb 2093056 Dec 13 23:57 2016-12-13
drwxr-xr-x 2 esb esb 2211840 Dec 15 02:08 2016-12-14
drwxr-xr-x 2 esb esb 2183168 Dec 15 23:58 2016-12-15
drwxr-xr-x 2 esb esb 2023424 Dec 16 23:58 2016-12-16
drwxr-xr-x 2 esb esb 2043904 Dec 17 23:59 2016-12-17
drwxr-xr-x 2 esb esb 2072576 Dec 18 23:58 2016-12-18
drwxr-xr-x 2 esb esb 2015232 Dec 19 23:58 2016-12-19
drwxr-xr-x 2 esb esb 1802240 Dec 20 18:51 2016-12-20

inside the directories are files which look like this

-rw-r--r-- 1 esb esb 10300 Dec 20 18:54 1000001209329.log
-rw-r--r-- 1 esb esb 977 Dec 20 18:54 1000001209330_error_1.log
-rw-r--r-- 1 esb esb 10317 Dec 20 18:54 1000001209330.log
-rw-r--r-- 1 esb esb 977 Dec 20 18:54 1000001209331_error_1.log
-rw-r--r-- 1 esb esb 10325 Dec 20 18:54 1000001209331.log
-rw-r--r-- 1 esb esb 971 Dec 20 18:54 1000001209332_error_1.log
-rw-r--r-- 1 esb esb 10273 Dec 20 18:54 1000001209332.log
-rw-r--r-- 1 esb esb 971 Dec 20 18:54 1000001209333_error_1.log
-rw-r--r-- 1 esb esb 8120 Dec 20 18:55 1000001209333.log
-rw-r--r-- 1 esb esb 977 Dec 20 18:54 1000001209334_error_1.log
-rw-r--r-- 1 esb esb 10293 Dec 20 18:54 1000001209334.log
-rw-r--r-- 1 esb esb 977 Dec 20 18:55 1000001209335_error_1.log
-rw-r--r-- 1 esb esb 10214 Dec 20 18:55 1000001209335.log
-rw-r--r-- 1 esb esb 977 Dec 20 18:55 1000001209336_error_1.log
-rw-r--r-- 1 esb esb 10309 Dec 20 18:55 1000001209336.log
-rw-r--r-- 1 esb esb 977 Dec 20 18:55 1000001209337_error_1.log
-rw-r--r-- 1 esb esb 8131 Dec 20 18:55 1000001209337.log
-rw-r--r-- 1 esb esb 971 Dec 20 18:55 1000001209338_error_1.log
-rw-r--r-- 1 esb esb 9712 Dec 20 18:55 1000001209338.log
-rw-r--r-- 1 esb esb 977 Dec 20 18:55 1000001209339_error_1.log
-rw-r--r-- 1 esb esb 6320 Dec 20 18:55 1000001209339.log
-rw-r--r-- 1 esb esb 971 Dec 20 18:55 1000001209340_error_1.log
-rw-r--r-- 1 esb esb 0 Dec 20 18:55 1000001209340.log
-rw-r--r-- 1 esb esb 0 Dec 20 18:55 1000001209341.log

one directory has 200,000 files in it.

I'm stumped, as i said the forwarder just gives up

ridwanahmed · ‎11-30-2020

what was your solution in this case, @gymmynzl --just better logging setup, or something else?

lguinn2 · ‎01-03-2017

OMG, "only" 45,000 files! There is no absolute "choke" number, but I would not go above 10K active files. (Although perhaps I am conservative for the latest version of Splunk. But still...)

Realize that the Splunk forwarder is examining every one of these 45,000 files to determine if it has been updated. So for each file, Splunk has to keep a record of "what was the prior file status: hash, size, last byte forwarded, last mod time, etc" and it also must access the current file information. It uses a hash to identify the files, and so the hash must be calculated for each file it examines - Splunk can't just look at the directory/inode information. Splunk uses the normal OS mechanisms to do all of this. There isn't any magic.
And, the forwarder does this repeatedly over the set of monitored files, as quickly as it can. Of course, there is some parallelization, but...

The problem becomes that the forwarder uses incrementally more memory and CPU as the number of files increases. Eventually, it starts peaking the CPU and simply can't keep up. If your forwarder is using more than about 5-10% of the CPU, it's not working very efficiently. If it's over 50%, the forwarder is probably thrashing somehow. (I don't know the internal details.)

Setting the ulimits higher may help, but only to a point. Clearly, you can set the ulimit to 64000 files, but that doesn't mean that the forwarder can manage that many! [Note that you do want a high ulimit on an indexer!]

This problem occurs even if only a small percentage of the 45,000 files are actively being updated. Splunk still has to examine every file to determine its status. So here are ideas for mitigating the problem:

1 - Keep a separate directory tree for logs that are no longer updated. For example, if the current logs are in /var/log, rotate old log files into /var/oldlog. (not /var/log/old) However, be sure to keep the current and most recent log in the current directory, in case Splunk is finishing up with a log file when it rotates.

2 - Particularly if you can't do good log file management, use the ignoreOlderThan = 1d setting in inputs.conf
This tells the forwarder to simply ignore files with last modification times greater than the setting. This will help a great deal - if the number of active files is relatively small. And of course, if the system never modifies the older files.

3 - If the number of active files is really so high, consider some other options. For example: Are you creating a ton of little files and could they be combined? Or - could you run multiple forwarders and have each forwarder monitor a subset of the files? The second option might work if each forwarder instance is not overloaded and the server has enough resources to run all the forwarders in parallel.

4 - Are you using a heavy forwarder? Avoid the heavy forwarder unless its use is absolutely required. The Universal Forwarder is more efficient and has much greater throughput. It uses less CPU and memory, plus it uses fewer network resources per event. The UF has a 256 kbps bandwidth limit, but the limit is easily removed using limits.conf

HTH

somesoni2 · ‎12-20-2016

200,000 files are just too much. What is the H/W configuration of the forwarder and how is the cpu usage (bet it's very high)?

gymmynzl · ‎12-20-2016

no its running on a high end AWS deployment.

gymmynzl · ‎12-20-2016

ok same issue and there is only 45000 files and it still chokes, so what is the max that they recommend ?

ddrillic · ‎12-20-2016

Absolutely, way too many files and the forwarder can't handle it. You can validate that actually the forwarder is fine by setting ignoreOlderThan to a day or so - inputs.conf.spec

When too many files are present, the forwarder just hangs.

What is your ulimit -n @gymmynzl ?

gymmynzl · ‎12-20-2016

ulimit -Hn is 64000

its only monitoring 5 directories, but in thoes directories each might have 50K log files in them

ddrillic · ‎12-20-2016

Right right. Try please ignoreOlderThan = 1d just to validate... I'm going through the exact same thing here...

ulimit -Hn as 64000 is good.

gymmynzl · ‎12-20-2016

just trying this now

Why does the Splunk Forwarder for Linux randomly stop sending and monitoring files?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

ATTENTION: We’re Moving! (AGAIN!)

Deep Dive: Optimizing Telemetry Pipelines in Splunk Observability Cloud

Announcing Modern Navigation: A New Era of Splunk User Experience

Join the Conversation