I want to monitor a folder that has 24 thousand files. I only want to collect data from the files that have a change date of 2015 and then only monitor the changes after that.
I guess I will have to do an initial collection from the files with a change date of this year, then change the collection to only monitor files that change.
Any help would be appreciated
Location for the files is c:\log\data
You can use ingnoreOlderThan
but if you do, beware that it does not work the way most people think that it does: once Splunk ignores the file the first time, it is in a blacklist and it will never be examined again, even if new data goes into it!
http://answers.splunk.com/answers/242194/missing-events-from-monitored-logs.html
Also read here, too:
http://answers.splunk.com/answers/57819/when-is-it-appropriate-to-set-followtail-to-true.html
I have used the following hack to solve this problem:
Create a new directory somewhere else (/destination/path/) and point the Splunk forwarder there. Then setup a cron job
that creates selective soft links to files in the real directory (/source/path/) for any file that has been touched in the last 5 minutes (or whatever your threshold is), like this:
*/5 * * * * cd /source/file/path/ && /bin/find . -maxdepth 1 -type f -mmin -5 | /bin/sed "s/^..//" | /usr/bin/xargs -I {} /bin/ln -fs /source/path/{} /destination/path/{}
The nice thing about this hack is that you can create a similar cron job to remove files that have not been changed in a while (because if you have too many files to sort through, even if they have no new data, your forwarder will slow WAY down) and if they ever do get touched, the first cron will add them back!
Don't forget to setup a 2nd cron to delete the softlinks, too, with whatever logic allows you to be sure that the file will never be used again, or you will end up with tens of thousands of files here, too.
You can use ingnoreOlderThan
but if you do, beware that it does not work the way most people think that it does: once Splunk ignores the file the first time, it is in a blacklist and it will never be examined again, even if new data goes into it!
http://answers.splunk.com/answers/242194/missing-events-from-monitored-logs.html
Also read here, too:
http://answers.splunk.com/answers/57819/when-is-it-appropriate-to-set-followtail-to-true.html
I have used the following hack to solve this problem:
Create a new directory somewhere else (/destination/path/) and point the Splunk forwarder there. Then setup a cron job
that creates selective soft links to files in the real directory (/source/path/) for any file that has been touched in the last 5 minutes (or whatever your threshold is), like this:
*/5 * * * * cd /source/file/path/ && /bin/find . -maxdepth 1 -type f -mmin -5 | /bin/sed "s/^..//" | /usr/bin/xargs -I {} /bin/ln -fs /source/path/{} /destination/path/{}
The nice thing about this hack is that you can create a similar cron job to remove files that have not been changed in a while (because if you have too many files to sort through, even if they have no new data, your forwarder will slow WAY down) and if they ever do get touched, the first cron will add them back!
Don't forget to setup a 2nd cron to delete the softlinks, too, with whatever logic allows you to be sure that the file will never be used again, or you will end up with tens of thousands of files here, too.
That's what I needed.
I think I can put in the "ingnoreOlderThan" attribute in inputs.conf
Then after collecting all the historical data from 2015 I can change the monitor to only tail files that are created or added or changed. If someone puts in old files then I might get some old info.
Does this sound right
When I change to Monitor using the tail function i should not get any of the older files since they have not changed in years. If the files do somehow change I should be able to capture the files.
OH I think these file are overwritten not appended too. will tail still work with overwrite?
I think you don't need the tail, simple monitoring will just do fine. Also, you can keep the ignoreOlderThan setting (my previous comment had spelling mistake, so don't copy that) on, as any new file or any change will make the modified date within your ignoreOlderThan limit, so they will get ingested.
I've not use followTail setting but as per caveat in the documentation about it's usage, I wouldn't suggest it using in ongoing fashion.
Not sure if there is any straight forwarder way to do this. You might have to use ingnoreOlderThan attribute in inputs.conf to give a value based on current date to include only the files modified in 2015 (e.g. 259 days as of today) to monitor files. If you don't expect any files with 2015 modified date dropped in future, this should do it.