Getting Data In

Splunk forwarder not releasing files

jrwebst
Explorer

All,

I am trying to figure out if there is a setting I may have missed somewhere or if this is just a Splunk problem. We have an application running that the Splunk forwarder is monitoring the log for. We needed to reclaim some space disk space so we restarted the application and deleted the current log, and the application rolled the new log out. However, Splunk did not release the deleted file until we restarted the forwarder per LSOF. I saw one other question about this from a few years back but it was never answered. Does anyone have any insight into this? Any help would be greatly appreciated.

1 Solution

dwaddle
SplunkTrust
SplunkTrust

The base behavior here is fundamental Unix. When any process calls the unlink() system call to remove a file, the blocks allocated to the file remain allocated until all processes that have the file open have closed it.

Typically this is not a big problem for Splunk as Splunk tends to open files for a very short window (a few seconds) and then close them when there is a brief window of idleness detected on the file. This idleness window is controlled by the TIME_BEFORE_CLOSE option in inputs.conf. Splunk can keep a file open for a very long time in a few edge cases:

  • If the TIME_BEFORE_CLOSE setting is set incredibly high for the file in question
  • If the file is active enough that it never has any "quiet time"
  • If the file has a large amount of data that needs to be processed, and it has been kicked over to a BatchReader thread in response

Usually, I would say that the BatchReader issue is most likely the problem in this case. There are a couple of things to tune to help with this, like maxKBps in limits.conf (to allow the forwarder to output more data at once to the indexers), and parallelPipelineCount in server.conf (to allow more threads to process data in parallel).

I would expect that in most cases Splunk would fairly quickly (minutes) read to the end of the deleted file, and then close it. At which time, the kernel would release all of the filesystem blocks allocated for the file. By stopping Splunk, you force it to close the file perhaps early, and probably caused some events to be lost. (Which may be something you don't care about)

View solution in original post

dwaddle
SplunkTrust
SplunkTrust

The base behavior here is fundamental Unix. When any process calls the unlink() system call to remove a file, the blocks allocated to the file remain allocated until all processes that have the file open have closed it.

Typically this is not a big problem for Splunk as Splunk tends to open files for a very short window (a few seconds) and then close them when there is a brief window of idleness detected on the file. This idleness window is controlled by the TIME_BEFORE_CLOSE option in inputs.conf. Splunk can keep a file open for a very long time in a few edge cases:

  • If the TIME_BEFORE_CLOSE setting is set incredibly high for the file in question
  • If the file is active enough that it never has any "quiet time"
  • If the file has a large amount of data that needs to be processed, and it has been kicked over to a BatchReader thread in response

Usually, I would say that the BatchReader issue is most likely the problem in this case. There are a couple of things to tune to help with this, like maxKBps in limits.conf (to allow the forwarder to output more data at once to the indexers), and parallelPipelineCount in server.conf (to allow more threads to process data in parallel).

I would expect that in most cases Splunk would fairly quickly (minutes) read to the end of the deleted file, and then close it. At which time, the kernel would release all of the filesystem blocks allocated for the file. By stopping Splunk, you force it to close the file perhaps early, and probably caused some events to be lost. (Which may be something you don't care about)

View solution in original post

adonio
SplunkTrust
SplunkTrust

hello there,
when you say "... did not release the deleted file..." do you mean you could not delete the file as it was open by another program, here the UF? can you elaborate? can you share the link to the other question?
can you elaborate on the insight you are looking for? i assume you are using [monitor://....] for the application log file, is that the case?

0 Karma

jrwebst
Explorer

@adonio

Yes, we are using the [monitor:///] stanza to declare the input. And everything works great. However, after we delete the file with rm it is gone. However if you look in lsof you can still see the file is open and the resources are still being utilized. If we restart the forwarder, it then releases that file.

koshyk
Super Champion

We had similar issue, but it turned out to be issue with Antivirus scanning the same file at the time.
Please ensure AV or other programs are not locking the file

0 Karma