I ran into a major problem, and to which I am unable to apply a real fix.
I have tried all versions of Forwarders (Linux x64), from 7.x.x to 8.x.x, the problem is always the same.
I have a directory with millions of small files, about 150.000 are being generated per day, continuously generating. To manage this path, the Forwarder starts to occupy several GB of RAM, instead of the few standard MB.
I tested an
ignoreOlderThan = 1h (also 5m) in inputs
but the result does not change, the GB are always several (about 20% of the System RAM)
Is there a method to avoid this excessive consumption?
There are some parameters within limits.conf that you could try tweaking but they can limit your functionality.
In the end it may turn out that you've simply bitten off much more than you can chew.
After all if you want to monitor many files, you need memory for file descriptors and all required OS structures so it might not be possible to monitor so many files with reasonable memory usage.
I already tried,
maxKBps = <integer>
to level the thruput, but that's not the problem.
max_fd = <integer>
to level the fd cache, but also with low values (example 5 in place of default 100), there is a high consumption.
I know, the problem is in OS and file management. I thought Splunk could level this.
No way. I'll give it a reason 👍 👊
Well, I have a forwarder monitoring exchange logs from several servers (around dozen of them or so - I'd estimate some 2, maybe 3 thousand files altogether). It can eat several gigabytes of RAM easily. If you have much more than that it's obvious you'll need huge amounts of memory. I'd try to think about splitting the workload between different UFs (like sharing the whole directory over NFS/CIFS and using whitelists to monitor only partial sets of the files on each UF).
But IMHO, it's not pretty obvious since i'm skipping unneeded files older then 5 minutes (ignoreOlderThan=5m), so i'm telling Splunk Forwarder to DO NOT ANALYZE files with timestamp older than 5m... but, obviously, to take the timestamp, i think it needs to "read" from OS and use a file.descriptor call, but really can't understand why so much ram to skip older files 🙄
Unfortunately i can't manage logs, since there is an Application Team who does that. And, as said before, there is already a whitelist to take only two types of files (but it really does not need, since all path is full of needed small files, divided in two prefix (ex. "Job.NODE1.<sessioname>.txt" and "Logging.NODE1.<sessionname>.txt", and they are +100.000 per day in small, very small size... but they have developed the Application this way 🤕 when a single file could be very usefull)... i think it's unuseful also installing more instances of forwarders, since forwarder1+forwarder2 ram consumption, will be something like a single forwarder instance 🙄
At the end of the story, anyway, Forwarder does its job, System RAM is 16GB, Applications work..... so... who cares? 😀
I don't know the exact code behind monitor inputs 🙂 but i suspect the directory has to be polled every now and then. Which means reading filesystem structures for those thousands of files. That on its own can be a performance hit. Apart from that UF keeps the fishbucket database. That also uses some memory...
You can see which files are tracked by the forwarder using
splunk list monitor
It will tell you a list of tracked files along with their status. But that you probably already know.
Thanks, i often use the "splunk list monitor" functions (as the btool that's magical 😊 ).
But in this case, it's irrilevant, since I KNOW it monitors those files in that path 😎 ... i was considering that just the fishbucket itsself could prevent from "analyze" those old irrilevant, yet ingested files, and leveles the resourses usage... but, it's not, as i can see 😤
I'm considering doing a daily rotation of older files, also removing them from monitored path or moving in a subdir, and leave only 1 day back of them in monitored path... i'll see.
Anyway, i think next Forwarders could implement something to manage this behaviour, maybe with a proper limit parameter in limits.conf.............. 3GB of ram for an agent, IMO, it's too much 🙄 but, maybe, we are in front of OS limit, so forwarder can't do nothing for it!!! Don't know.
IF someone wants to reproduce the issue, here's some data.
THE PACK WITH MULIPLE FILES,
just an example, but quite equal to what i ingest in Production Environment, i put them in /tmp/FILES/,
THE TA, a simple banal inputs.conf,
Start Splunk with the TA enabled, and see,
ps -eo args,pmem|grep "^[s]plunkd"
splunkd -p 8089 restart 79.6
79,6% of System RAM used 😓😓😓 (media is 70%)
NOW, let's see disabling the TA,
ps -eo args,pmem|grep "^[s]plunkd"
splunkd -p 8089 restart 4.8
Heeeeee, man... too simple 😉
Directory has ONLY needed files, and there's also, for security, a whitelist to get the right proper files, and they're, as i said, 150000 per day 🙄
Inputs has already its proper file get monitor... no other files are present in the path, nor they should be touched since not in whitelist input 😞
Files are correctly ingested, so the forwarder metrics... all is fine!!! But forwarder gets 3GB of System RAM... 😲