We are using Splunk forwarder (splunkforwarder-6.6.2-4b804538c686-linux-2.6-x86_64-manifest) to gather logs from Java applications configured to use logback.xml with rolling policy. Sometimes, when the log file is rolled and compressed, Splunk process will go in a loop and take a lot of CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9064 splunk 20 0 253m 157m 12m S 81.2 4.1 129:18.10 splunkd 9072 splunk 20 0 70316 11m 1456 S 3.7 0.3 5:32.13 splunkd
/opt/splunkforwarder/var/log/splunk/splunkd.log is full with messages like this:
03-19-2018 16:00:13.237 +0100 INFO ArchiveProcessor - Handling file=/export/logs/app/app-2018-03-19.5.log.gz 03-19-2018 16:00:13.237 +0100 INFO ArchiveProcessor - reading path=/export/logs/app/app-2018-03-19.5.log.gz (seek=0 len=6818141) 03-19-2018 16:00:13.283 +0100 WARN ArchiveProcessor - Could not access archive=/export/logs/app/app-2018-03-19.5.log.gz, because:retrieve: key=0x963d68b47a9905d1 is already locked with state=0x7f2436c97200 03-19-2018 16:00:13.283 +0100 INFO ArchiveProcessor - Handling file=/export/logs/app/app-2018-03-19.5.log.gz 03-19-2018 16:00:13.284 +0100 INFO ArchiveProcessor - reading path=/export/logs/app/app-2018-03-19.5.log.gz (seek=0 len=6818141) 03-19-2018 16:00:13.327 +0100 WARN ArchiveProcessor - Could not access archive=/export/logs/app/app-2018-03-19.5.log.gz, because:retrieve: key=0x963d68b47a9905d1 is already locked with state=0x7f2436c97200
I would also spawn a lot of child processes and update "/var/run/splunk/splunkd.pid".
The creation time of that
app-2018-03-19.5.log.gz file correlates with the start of CPU spike. Source code and documentation of LOGBack [https://logback.qos.ch/documentation.html] tell us that they don't use temporary files when rolling the logs and performing compression.
followTail. The Admin manual says "Do not use followTail for rolling log files (log files that get renamed as they age."
Try one of these settings. Notice I changed
ignoreOlderThan to allow for outages.
[monitor:///export/logs/app/*.log] ignoreOlderThan = 3d
[monitor:///export/logs/app/] ignoreOlderThan = 3d whitelist=^/export/logs/app/app.*.log$
followTail = 0 did not solve the issue as the archive is still getting
locked with state. Restarting splunk service resolves the issue temporary as the lock is being lifted and the same archive get processed with
was already indexed as a non-archive, skipping.
The next step should be to exclude
*.gz files from processing which might mean that is some scenarios are loosing log entries.
Hi, thank you for your answer. Could you be more specific on which section might be causing the issue? The only one that I could think of is
followTail = 1. This issue doesn't happen consistently, only is some cases when the logs are being rolled over and compressed.