What is the appropriate use of 'followTail' for file monitor inputs in inputs.conf? In which cases is it useful to set this to 'true'? What are the implications of doing so?
The short version is: do not use followTail.
(At the risk of diluting this post, as amrit points out, ignoreOlderThan may do what you need more clearly and understandably. Documenting that setting here seems like overkill though.)
The long answer is: followTail can be used to cause splunk to permanently set data in old log files to be ignored. This can be useful for cases where there is a very large amount of historical data present, and where the work required to index and provide searching for this historical data is not considered a good tradeoff for the timeliness of beginning to use the current data.
To use followTail safely, the following steps can be used:
DO NOT leave followTail enabled.
In short, leaving followTail enabled in an ongoing fashion is not going to behave as you might imagine. Likely there will end up being edge cases where not all data you expect to be indexed will be indexed.
The following problems exist at later dates if you leave followTail enabled in an ongoing fashion:
If it is more important for me to not re-index every line of a file with current time stamps than have the possibility of missing some lines of a file on the off chance that the splunkforwarder service isn't running, is there a reason not to use followtail?
edonze, i changed the HIST settings to change add in a epoch time:
readonly HISTSIZE=5000
readonly HISTTIMEFORMAT="%y/%m/%d %T "
readonly HISTCONTROL=ignoreboth
i also centralised all bash history using HISTFILE into one location so that all history files could be monitored easily
What jrodman said - this is probably not related to followTail. It might be another setting that is causing it: http://answers.splunk.com/answers/8059/splunk-duplicating-events-every-time-file-changes
Since I commented followTail = 1 in the stanza below, it consumes the entire file each time it is updated, so instead of a few entries with current timestamp, I get hundreds of lines with a current timestamp, even though the previous lines have been consumed before.
[monitor:///root/.bash_history]
index = os
sourcetype = cmdhistory
source = root
disabled = 0
#followTail = 1
ignoreOlderThan = 1d
Splunk does not reindex anything, at least by design or intent. Avoiding reindexing should never require followTail. If you have a scenario where it does, that's a separate question/answer.
One scenario where I wanted to avoid re-indexing data was with our DR or backup intermediate forwarder. The backup intermediate forwarder has splunk disabled until needed in a disaster recovery scenario. All the inputs, (syslog archives, and NFS mounted filesystems with aggregated logs) on the backup intermediate forwarder are the same as the inputs on the primary intermediate forwarder. When bringing up the back-up intermediate forwarder, it would read, forward all the historical data from when it was last run which needs to be avoided. In this case, doesn't using followTail = 1 make sense? If not, what would be a better scenario for a back-up intermediate forwarder?
The short version is: do not use followTail.
(At the risk of diluting this post, as amrit points out, ignoreOlderThan may do what you need more clearly and understandably. Documenting that setting here seems like overkill though.)
The long answer is: followTail can be used to cause splunk to permanently set data in old log files to be ignored. This can be useful for cases where there is a very large amount of historical data present, and where the work required to index and provide searching for this historical data is not considered a good tradeoff for the timeliness of beginning to use the current data.
To use followTail safely, the following steps can be used:
DO NOT leave followTail enabled.
In short, leaving followTail enabled in an ongoing fashion is not going to behave as you might imagine. Likely there will end up being edge cases where not all data you expect to be indexed will be indexed.
The following problems exist at later dates if you leave followTail enabled in an ongoing fashion:
I can see why writing a reliable follow tail is inherently hard.
But isn't it a problem that always exists, regardless of whether or not the follow tail flag is ticked?
i.e. If you don't have a way to know if you've read a record before then your options are to discard it, and possibly lose data, or to index it, and possibly duplicate data.
If so, then would it be fair to say that follow-tail=true means that records that are potential duplicates are discarded and that follow-tail=false means that records that are potential duplicates are indexed?
Right now, as an edge-case feature in the first place there's not a lot of pressure to rebuild a totally different replacement. If you think we're dumb or that reliable filtering of old data is important to you, then please do tell sales, support, etc. The normal channel is to file an enhancement request with Splunk Support.
That's the relatively easy-to-understand problem, and in some environments you can assure yourself that you're not ever going to be adding new filenames, or if you are that you don't care about missing the beginning of the first time the filename appears. However there are additional edge cases that are impossible to handle with the intersection of the concept of followTail and the overall design of Splunk Tailing. One of the two has to be changed pretty fundamentally to make it reliable. continued...
@BenAveling: No, this is essentially unchanged for Splunk 6. One of the problem with FollowTail is that it's impossible to succinctly describe. For example in that copy, what does "sees it" mean? What file? What is the first time? Consider the case where you are monitoring a directory in an ongoing fashion that you set up 6 months ago, and a new filename comes to exist. That's "the first time Splunk sees it", so we'll just not index some of the beginning of the file. continued..
In splunk 6, the screen "Add new Data inputs » Files & directories » Add new" says of Follow Tail that "This only applies to the file the first time Splunk sees it. After that, Splunk's internal file position records keep track of it."
Does this mean that the problems described above no longer exist?
Rob Jordan:
I've added some information along these lines to inputs.conf.spec, which is the real doc for followTail. I've talked with the docs team about where this might live in the web documentation. Probably this becomes a 'best practice' description around selectively ingesting from a historical log archive, which would probably suggest ignoreOlderThan (see amrit) with a fallback to the approach listed above.
what about ignoreOlderThan?
Thanks for the great post, Josh. I think this should be added as part of the official documentation.
Before today, my understanding was that it would just grab any new records at the end of log and would be just as reliable as starting at the beginning of the log.
This is my new understanding for the follow tail event loss potential:
followtail=1 + (log file roll or new log file) + (high volume log or high load on host or forwarder restart) = possible event loss near head of the log
I hope Splunk will update the followtail code or options to correct scenarios where events could be lost. Possibly an an option for auto-disable once all inputs have been scanned once or somehow compare splunkd startup time to event time in the log being indexed?
Thanks,
Rob
Set this flag when you don't want to ingest all of the historical data in the file you are monitoring. Similar to doing a tail -f filename
in Unix. Splunk will only send the latest data.
Folks, let's not vote tgow's post down. It's not wrong, and he or she answered correctly. Yes, we posted this as a softball to talk about the problems with followTail, but tgow's answer was first, and correct. We can vote mine up if you like.