Getting Data In

Universal forwarder high CPU and memory usage

Explorer

I've just installed Splunk Universal Forwarder 4.2.1 on a Linux server. I've pointed it at the whole of /var/log, which amounts to 3220 files, 24 directories and 467MiB of data.

Both CPU and memory usage of the forwarder seems to be way too high. One splunkd process seems to almost continuously use 100% of one CPU, and this same process is using 525MiB(!) of memory.

I don't see anything pertinent in the splunkd logs. strace of the splunkd process shows it calling futex() and epoll_wait() a lot and not much else...

John.

I ran into a similar problem with the latest version of the Universal Forwarder (5.0.4) eating up pretty much all my 3 hosts' 8 GB of RAM, then eating all the swap, and ultimately causing it to fall over.

What fixed it for me was disabling a rogue input in the Unix app that I'd installed, which was generating large numbers of errors in the splunkd.log along the following lines...

08-02-2013 11:12:17.610 +0100 WARN  FilesystemChangeWatcher - error reading directory "/home/[blahblah]": Permission denied
08-02-2013 11:12:17.699 +0100 ERROR TailingProcessor - Unable to resolve path for symlink: /home/[blahblah].

Basically, it looks as though the Splunk agent wasn't able to successfully read the contents of users' home directories which seemed to be causing it to leak memory.

The input I ended up disabling was:

[monitor:///home/.../.bash_history]

Since then, no more memory leakage... so far!

SplunkTrust
SplunkTrust

I too had issues with the splunkuniversalforwarder taking too much CPU. Here's several items I found that cause extra load.

  1. Directories with large amounts of files:

    a. Fix with whitelists/blacklists

    b. Using broad regex matches causes high cpu if large # of files (.*log, etc.)

c. Fix by finding hidden directories with large amount of files & blacklist the hidden directories.

d. Presumably, if you had permissions issues on directories with large amounts of files it would cause CPU load to increase.

e. Presumably, if you had large amounts of files with improper event recognition, transformations, etc.

  1. Remove unnecessary configs :

a. recursive=true & followtail=0 are both defaults and take up additional resources if used

b. misspellings or improper use of input stanzas that are inappropriate for forwarders, etc. can cause exponentially worse CPU usage as file counts increase.

0 Karma

Explorer

"I am wondering what is the maximum number of files one can track before cpu utilization jumps above an average of 8%?"

I have been wondering this as well. In my case the only difference I can find between different servers is the number of files in the monitored directories. The total number of files on the High CPU usage machines totals between 8 and 9 thousand. 6,500 are in one directory alone. All the other instances have less than 2 thousand files to track with CPU utilization being very low. (averaging less than 1% CPU)

0 Karma

Explorer

I would love to know the total number of files you have in the directories listed in your inputs.conf file.

I have several AIX Universal forwarder instances that are very similar data wise and identical physically. 2 instances have very high CPU usage by Splunkd. (averaging 12 to 14%)

The only difference I can find between them is the number of files in the monitored directories. The total number of files on the High CPU usage machines totals between 8 and 9 thousand. 6,500 are in one directory alone. All the other Instances have less than 2 thousand files to track with CPU utilization being very low. (averaging less than 1% CPU)

I am wondering what is the maximum number of files one can track before cpu utilization jumps above an average of 8%?

0 Karma

Explorer

I have several Splunk universal forwarders have similiar extreme high CPU utilitization which at the end crash the database cluster...learned/props.conf & sourcetypes only have few lines each.. below is the inputs.conf of one of UF that occasionally high CPU utlitization.

[fschange:/var/lib/mysql-cluster/config.ini]

pollPeriod = 3600

fullEvent = true

[fschange:/var/lib/mysql-cluster/config.ini.bak]

pollPeriod = 3600

fullEvent = true

[fschange:/etc/my.cnf]

pollPeriod = 3600

fullEvent = true

Is it possible to see what splunk is doing when it is using like 95% CPU?

0 Karma

Path Finder

Hi John,

Does the data appear in the search via WebGui?

Which Linux do you use?

The Universal Agent seems to grab all available logs and send them to the indexer. If you want to avoid processing of old entries just let the UAgent pick up what is new and comes in after starting:

[monitor:///var/log]
followTail = 1
# Determines whether to start monitoring at the beginning of a file or at the end (and then index all events that come in after that)
# If set to 1, monitoring begins at the end of the file (like tail -f).
0 Karma

Explorer

Well, I managed to fix the problem by adding a blacklist for /var/log/nagios/archives (397MiB in 3085 files - 1 new file per day) and /var/log/nagios/spool (constant creation/change of files).

Somehow seems unsatisfactory that the former needs to be excluded...

0 Karma

Path Finder

What I can see in a quick shot: etc/apps/learned/local/props.conf (7MB) and sourcetypes.conf (2.7MB) are full of selflearned sourcetypes. Don't know why, maybe during testing (permissions on var/log/nagios/spool/checkresults ). I can imagine this causes high CPU. I have no idea, why automatic Sourcetype Learning is enabled on a UniversalForwarder. Doesn't make sense for me. Quick help: Delete etc/apps/learned/local/* and add to etc/system/local/props.conf: LEARN_SOURCETYPE = false

0 Karma

Explorer

OK, here's the diag file - I delayed a bit as I was a bit unsure what data was included in the diag file. http://www.mediafire.com/?2wkm2tvzwmkvn47

0 Karma

Path Finder

Ok, now I'm catched here ;-). I hope it's not a big thing...

I'd like to see an output of:
$SPLUNK_HOME/bin/splunk diag
.You might put it on http://www.ge.tt/

0 Karma

Explorer

Hmm, followTail doesn't appear to have helped - still using an obscene amount of CPU and memory for a "small" log monitoring app. 😞

0 Karma

Explorer

Data is entering the index, yes.

CentOS 5.6 x86_64.

I can certainly see if followTail helps, thanks.

0 Karma