Set this flag when you don't want to ingest all of the historical data in the file you are monitoring. Similar to doing a
tail -f filename in Unix. Splunk will only send the latest data.
Folks, let's not vote tgow's post down. It's not wrong, and he or she answered correctly. Yes, we posted this as a softball to talk about the problems with followTail, but tgow's answer was first, and correct. We can vote mine up if you like.
The short version is: do not use followTail.
(At the risk of diluting this post, as amrit points out, ignoreOlderThan may do what you need more clearly and understandably. Documenting that setting here seems like overkill though.)
The long answer is: followTail can be used to cause splunk to permanently set data in old log files to be ignored. This can be useful for cases where there is a very large amount of historical data present, and where the work required to index and provide searching for this historical data is not considered a good tradeoff for the timeliness of beginning to use the current data.
To use followTail safely, the following steps can be used:
DO NOT leave followTail enabled.
In short, leaving followTail enabled in an ongoing fashion is not going to behave as you might imagine. Likely there will end up being edge cases where not all data you expect to be indexed will be indexed.
The following problems exist at later dates if you leave followTail enabled in an ongoing fashion:
Thanks for the great post, Josh. I think this should be added as part of the official documentation.
Before today, my understanding was that it would just grab any new records at the end of log and would be just as reliable as starting at the beginning of the log.
This is my new understanding for the follow tail event loss potential:
followtail=1 + (log file roll or new log file) + (high volume log or high load on host or forwarder restart) = possible event loss near head of the log
I hope Splunk will update the followtail code or options to correct scenarios where events could be lost. Possibly an an option for auto-disable once all inputs have been scanned once or somehow compare splunkd startup time to event time in the log being indexed?
I've added some information along these lines to inputs.conf.spec, which is the real doc for followTail. I've talked with the docs team about where this might live in the web documentation. Probably this becomes a 'best practice' description around selectively ingesting from a historical log archive, which would probably suggest ignoreOlderThan (see amrit) with a fallback to the approach listed above.
In splunk 6, the screen "Add new Data inputs » Files & directories » Add new" says of Follow Tail that "This only applies to the file the first time Splunk sees it. After that, Splunk's internal file position records keep track of it."
Does this mean that the problems described above no longer exist?
@BenAveling: No, this is essentially unchanged for Splunk 6. One of the problem with FollowTail is that it's impossible to succinctly describe. For example in that copy, what does "sees it" mean? What file? What is the first time? Consider the case where you are monitoring a directory in an ongoing fashion that you set up 6 months ago, and a new filename comes to exist. That's "the first time Splunk sees it", so we'll just not index some of the beginning of the file. continued..
That's the relatively easy-to-understand problem, and in some environments you can assure yourself that you're not ever going to be adding new filenames, or if you are that you don't care about missing the beginning of the first time the filename appears. However there are additional edge cases that are impossible to handle with the intersection of the concept of followTail and the overall design of Splunk Tailing. One of the two has to be changed pretty fundamentally to make it reliable. continued...