Getting Data In

Does Splunk re-index a file that was ignored due to ignoreOlderThan value, but modified recently?

SplunkTrust
SplunkTrust

Hi,

I have a folder being monitored and ignoreOlderThan is set as 4 days. Since, the environment is not used frequently, the data is not written to logs daily.

I have an application which generates some logs, say transactions.log. The application generated the logs 10 days back. Today, due to ignoreOlderThan is set as 4 days, that file is not being monitored. Today I ran the application again.The functionality of the application is such that it will write the logs to current transactions.log file till Noon and then roll over the file to transactions.log.<>, creates a new transactions.log and writes data to it till tomorrow noon.

Some data was written to older transactions.log file and then it got saved to transactions.log.20140815 and after noon, all data was written to newly created transactions.log.

When I checked Splunk, I don't see the event written before Noon today (to transactions.log which was created 10 days back and was ignored). I can clearly see the modtime for the file is changed, but Splunk is still not indexing the new data from today before noon.

Does Splunk re-index new data from file which was ignored earlier but was modified recently (so its definitely within ignoreOlderThan limit of 4 days)?

Thanks in advanced.

1 Solution

Motivator

ignoreOlderThan is based on the timestamp of the file. If you have ignoreOlderThan set to, say, 1d, then an old log file last updated 10 days ago but then updated with new log data becomes a candidate for being forwarded/indexed because the files timestamp changes to the current day.

View solution in original post

Communicator

Any new update on this ? We have set ignore older than one day, my question is do we face any problem in future ? or ignore older than one day is fine ?

0 Karma

SplunkTrust
SplunkTrust

I disagree with the accepted answer here, I've had a support case logged on this and the answer is that once a file is ignored it remains in the ignore list.

When discussing "ignoreOlderThan" in a Splunk case:

Using that attribute will cause files
to be ignored even after new data has
been added to them. Once the file is
ignored it will always be ignored.

This is not a bug. This is expected
behavior for that feature.

The only way to stop this is to remove
that feature and restart Splunk.

Thank you, Splunk Support

I would avoid the use of that attribute where possible, as restarting a Splunk forwarder is the only way I know of to reset the ignore property...

Explorer

You are correct. Once a file becomes ignored, it never comes back to being monitored even if its timestamp is updated.

0 Karma

Communicator

Thank you so very much dear

0 Karma

Legend

Moral of the story: Avoid using ignoreOlderThan

Its behavior depends on too many factors, as you can see from all the comments. Instead, follow this practice:

After a log file has been closed and not updated for some time period (you decide how long), move it to another directory.
If your normal log directory is /var/log - move it somewhere not in the tree, such as /old/var/log
But don't move a log file just as soon as it closes - give Splunk some time to finish indexing the file.
Roll log files regularly - I personally would not let a log file stay open for days. Roll every 24 hours or when it gets to some relatively small size (like 10MB).
Read all of @jrodman's remarks.

Write a script or use a log file management tool that is appropriate for your OS. Managing your log file directories can save you disk space, make recovery easier, and make Splunk run faster and cleaner.

0 Karma

Splunk Employee
Splunk Employee

I do not agree with this flat advice.

If you have to monitor a location where you have a very large corpus of files that will not change, ignoreOlderThan can be critical to achieving a workable result. The cost of repeatedly calling stat() on a fileset in the hundreds of thousands or millions of files will preclude effective data acquisition on any type of rotating storage or even hybrid storage. Even on solid state storage the cpu overhead may become unworkable.

Yes, it's vastly better in many ways to simply not retain your archived data in the location that Splunk monitors. ignoreOlderThan is a workaround for cases where you cannot dictate the policy in the log-storage location. This will lower unavoidable operating-system overhead caused by providing metadata on these files to Splunk, as well as Splunk costs related to content-tracking to determine these files are already handled. There are also disk-space costs with all the content tracking, and in-memory costs in file input to retain some information about all of these files.

But sometimes the choice is simply not available.

Legend

I agree; your comments are right on. That's why I said "avoid" - I didn't mean for the advice to be absolute. Sorry if it came off that way. But I do think that the best way to deal with this problem is to manage the files and the directory tree directly.

If you can't manage the files and directories to remove and archive old files, then ignoreOlderThan does provide a way to minimize the overhead caused by many old/stale files. And in some cases, it may be the best alternative. This whole thread though, points out some of the gotchas and limitations of ignoreOlderThan - I think it is wise to be aware of these if you do use ignoreOlderThan.

In my experience, huge directory trees are often the cause of indexing delays - I think you describe the problem nicely in your comments. One way or another, admins need to be able to recognize this problem and cope with it if it occurs.

0 Karma

Splunk Employee
Splunk Employee

I agree and disagree.

If Splunk customers have legit needs to monitor locations where millions of files live, we need to handle it.

However, there are unavoidable runtime costs for doing so, no matter how much we optimize Splunk for it. So it's best to have some ability to reasonably differentiate between "data we want to use" and "data we generated 8 years ago".

That said, the feature is definitely not intended for "skip old data". It doesn't really achieve that goal cleanly, which is where I think most people go wrong.

0 Karma

Splunk Employee
Splunk Employee

Indeed, this is precisely why the spec file states:

  * As a result, do not select a cutoff that could ever occur for a file
    you wish to index.  Take downtime into account!
    Suggested value: 14d , which means 2 weeks

Splunk Employee
Splunk Employee

Additionally, you are likely seeing interaction with modern windows (post-vista) modtimes. Windows, for its inscrutable reasons, does not update the modification time on files when the files are modified. Thus a file that is open for many days will become many days old, with the modtime only updated when the file is closed.

Independently, ignoreOlderThan must be very aggressive, because it is designed to improve cases where extremely large numbers of files are present, and even checking the modtime on the files becomes too large an expensive to keep up with the data. That said, there is room to make this more explicitly documented and we are doing so.

Path Finder

I'm seeing the exact same behaviour as well and I believe your description fits my issue. The comments that it should re-index the file if the timestamp is modified appears to be incorrect, or at least a bug. It would be good to get an official comment on this behaviour.

0 Karma

Explorer

We have the same behavior on our 6.2 Windows forwarders that have ignoreOlderThan. New events are not indexed once a file has fallen outside of the monitoring range.

0 Karma

Motivator

ignoreOlderThan is based on the timestamp of the file. If you have ignoreOlderThan set to, say, 1d, then an old log file last updated 10 days ago but then updated with new log data becomes a candidate for being forwarded/indexed because the files timestamp changes to the current day.

View solution in original post

Splunk Employee
Splunk Employee

I downvoted this post because

ignoreolderthan = [s|m|h|d]
* the monitor input will compare the modification time on files it encounters
with the current time. if the time elapsed since the modification time
is greater than this setting, it will be placed on the ignore list.
* files placed on the ignore list will not be checked again for any
reason until the splunk software restarts, or the file monitoring subsystem
is reconfigured. this is true even if the file becomes newer again at a
later time.
* reconfigurations occur when changes are made to monitor or batch
inputs via the ui or command line.

Refer the below link:

https://docs.splunk.com/Documentation/Splunk/6.5.2/Admin/Inputsconf#Valid_input_types_follow.2C_alon...

0 Karma

Motivator

Quoting from the Splunk documentation, "The setting applies to the modification time of a log file, not the timestamp of the individual events in a file." This is from the section, "Modify default limit on older files" in http://docs.splunk.com/Documentation/Storm/Storm/User/Editinputsconf

0 Karma

Path Finder

Yes the docs say that, but it doesn't appear to happen in practice. I just did a very simple test using ignoreOlderThan 2d

First I echo'd the date to a file a couple of days ago:

-rw-r--r-- 1 root root      29 Oct 11 04:49 testing-leaving-file-for-2-days.log

I verified the entry in the log was indexed and waited 2 days before appended the date again:

# date
Mon Oct 13 10:32:53 UTC 2014

# date && echo `date` >> testing-leaving-file-for-2-days.log
Mon Oct 13 10:33:42 UTC 2014

-rw-r--r-- 1 root root      58 Oct 13 10:33 testing-leaving-file-for-2-days.log

The file was not re-read. I restarted the forwarder and hey presto, it picked it up. It does not behave how the documentation says it should.

0 Karma

Motivator

Interesting that it did not index the new entry until you restarted it. If you are running 6.x code, I'd submit a bug report. If it is an older version they probably would ignore it.

0 Karma

SplunkTrust
SplunkTrust

My observation, once a file is ignored due to its last modification time was older than ignoreOlderThan value, Splunk will not read that file even if the file modification time gets updated. But then, if the forwarder restarts, its calculates the files to be monitored again and this time the file which was updated will get indexed. [and will be monitored for changes till the time its again ignored]. Seems like within a session of splunkd a file ignore list is maintained and once ignored is ignored for that session.

Influencer

This was golden insight. Thanks. Led to a fix for a problem I've been chasing for weeks now (related to ignoreOlderThan).