Using RHEL6 on a 12 core, 32G RAM, relatively idle server (it runs backups at night), running 4.3.3. Splunk currently has ONE input (/var/log/), is forwarding everything, and not keeping a local copy. We have an enterprise license, this is acting as a slave license client.
I'm sitting here with a shell running 'top' and a browser window at the 'Data inputs'. I've had to disable all the inputs to get the CPU for splunkd to STOP consuming 100%. I can toggle the data input to /var/log to 'enable' -- CPU goes to 100. I disable it, then CPU goes to minimal (0 or .3). Back and forth I do this, to test the cause-and-effect.
There's nothing special in /var/log/ -- in fact there's no new activity going on at all. The logs under /opt/splunk/var/log/splunk/ are quiet except for the occasional INFO entry from metrics.log. Even when a directory input is enabled (and CPU goes to 100+) the worst thing logged was an occasional WARN that said something to the effect of an invalid file in the directory because it was binary.
I've seen this on other systems, but attributed it to optimizations or just busy machines; this is not happening here.
Ideas?
Thanks,
Solved it by the old Windows trick: uninstalling and re-installing. Corrupt something, somewhere?
I opened a case, and after a condescending reply from their tech support that told me I was digesting .gz files and such, they pointed me to the on-line documentation on how to edit the inputs.conf file. Admittedly, I did have it miss-configured initially, but corrected it days ago. They overlooked the fact that I disabled all inputs during testing, and could enable/disable the /var/log on the local machine to duplicate the problem each time -- standard stuff in the /var/log -- no .gz files. Also confirmed the inputs.conf file (and sent them a copy) only had this and other /var/log sources in it (again, all disabled during testing).
Go figure...anywho, fixed now.
It would be best to file a case for this one and upload a diag so we may look at your logs, among other things.
It was just updated and rebooted this morning...
(that was a good answer though!)
what's the uptime on the box - if its not been rebooted since the leap second addition and you use ntp, that'll cause very high splunkd usage. Google for leap second linux kernel - there's a simple fix by stopping ntp and manually setting the date.