All,
I am trying to figure out if there is a setting I may have missed somewhere or if this is just a Splunk problem. We have an application running that the Splunk forwarder is monitoring the log for. We needed to reclaim some space disk space so we restarted the application and deleted the current log, and the application rolled the new log out. However, Splunk did not release the deleted file until we restarted the forwarder per LSOF. I saw one other question about this from a few years back but it was never answered. Does anyone have any insight into this? Any help would be greatly appreciated.
The base behavior here is fundamental Unix. When any process calls the unlink()
system call to remove a file, the blocks allocated to the file remain allocated until all processes that have the file open have closed it.
Typically this is not a big problem for Splunk as Splunk tends to open files for a very short window (a few seconds) and then close them when there is a brief window of idleness detected on the file. This idleness window is controlled by the TIME_BEFORE_CLOSE
option in inputs.conf. Splunk can keep a file open for a very long time in a few edge cases:
BatchReader
thread in responseUsually, I would say that the BatchReader issue is most likely the problem in this case. There are a couple of things to tune to help with this, like maxKBps
in limits.conf
(to allow the forwarder to output more data at once to the indexers), and parallelPipelineCount
in server.conf
(to allow more threads to process data in parallel).
I would expect that in most cases Splunk would fairly quickly (minutes) read to the end of the deleted file, and then close it. At which time, the kernel would release all of the filesystem blocks allocated for the file. By stopping Splunk, you force it to close the file perhaps early, and probably caused some events to be lost. (Which may be something you don't care about)
The base behavior here is fundamental Unix. When any process calls the unlink()
system call to remove a file, the blocks allocated to the file remain allocated until all processes that have the file open have closed it.
Typically this is not a big problem for Splunk as Splunk tends to open files for a very short window (a few seconds) and then close them when there is a brief window of idleness detected on the file. This idleness window is controlled by the TIME_BEFORE_CLOSE
option in inputs.conf. Splunk can keep a file open for a very long time in a few edge cases:
BatchReader
thread in responseUsually, I would say that the BatchReader issue is most likely the problem in this case. There are a couple of things to tune to help with this, like maxKBps
in limits.conf
(to allow the forwarder to output more data at once to the indexers), and parallelPipelineCount
in server.conf
(to allow more threads to process data in parallel).
I would expect that in most cases Splunk would fairly quickly (minutes) read to the end of the deleted file, and then close it. At which time, the kernel would release all of the filesystem blocks allocated for the file. By stopping Splunk, you force it to close the file perhaps early, and probably caused some events to be lost. (Which may be something you don't care about)
hello there,
when you say "... did not release the deleted file..." do you mean you could not delete the file as it was open by another program, here the UF? can you elaborate? can you share the link to the other question?
can you elaborate on the insight you are looking for? i assume you are using [monitor://....] for the application log file, is that the case?
@adonio
Yes, we are using the [monitor:///] stanza to declare the input. And everything works great. However, after we delete the file with rm it is gone. However if you look in lsof you can still see the file is open and the resources are still being utilized. If we restart the forwarder, it then releases that file.
We had similar issue, but it turned out to be issue with Antivirus scanning the same file at the time.
Please ensure AV or other programs are not locking the file