Solved: Re: When is it appropriate to set followTail to 't...

hexx · ‎09-05-2012

What is the appropriate use of 'followTail' for file monitor inputs in inputs.conf? In which cases is it useful to set this to 'true'? What are the implications of doing so?

jrodman · ‎09-05-2012

The short version is: do not use followTail.

(At the risk of diluting this post, as amrit points out, ignoreOlderThan may do what you need more clearly and understandably. Documenting that setting here seems like overkill though.)

The long answer is: followTail can be used to cause splunk to permanently set data in old log files to be ignored. This can be useful for cases where there is a very large amount of historical data present, and where the work required to index and provide searching for this historical data is not considered a good tradeoff for the timeliness of beginning to use the current data.

To use followTail safely, the following steps can be used:

Set up a monitor stanza to match the files you will want to handle, for which you only want data arriving after the initial glance.
For this stanza, enable followTail = true
Restart, or start splunk, to make this setting active
Wait enough time to mark all these files as read (say, a minute for every few thousand files). This can be slower if tailing is held back for any normal reasons.
Remove the followTail setting from your stanza
Restart splunk without this setting.

DO NOT leave followTail enabled.

In short, leaving followTail enabled in an ongoing fashion is not going to behave as you might imagine. Likely there will end up being edge cases where not all data you expect to be indexed will be indexed.

The following problems exist at later dates if you leave followTail enabled in an ongoing fashion:

followTail cannot reliably correctly handle files created in the monitored location while Splunk is down, or in the process of restarting.
followTail cannot reliably correctly handle files that are created in the monitored location while splunk is in the process of initially scanning for existing files (a second or so to many minutes, depending upon whether tens or hundreds of thousands of files are being monitored).
For files where the beginning of the file is correctly skipped as desired, atomic replacements of the file contents (may happen in log rotation cases) can cause the beginning of the new contents to be skipped past as well.

View solution in original post

edonze · ‎04-14-2014

If it is more important for me to not re-index every line of a file with current time stamps than have the possibility of missing some lines of a file on the off chance that the splunkforwarder service isn't running, is there a reason not to use followtail?

mario_traf · ‎06-16-2014

edonze, i changed the HIST settings to change add in a epoch time:
readonly HISTSIZE=5000
readonly HISTTIMEFORMAT="%y/%m/%d %T "
readonly HISTCONTROL=ignoreboth
i also centralised all bash history using HISTFILE into one location so that all history files could be monitored easily

the_wolverine · ‎04-15-2014

What jrodman said - this is probably not related to followTail. It might be another setting that is causing it: http://answers.splunk.com/answers/8059/splunk-duplicating-events-every-time-file-changes

edonze · ‎04-15-2014

Since I commented followTail = 1 in the stanza below, it consumes the entire file each time it is updated, so instead of a few entries with current timestamp, I get hundreds of lines with a current timestamp, even though the previous lines have been consumed before.

[monitor:///root/.bash_history]
index = os
sourcetype = cmdhistory
source = root
disabled = 0
#followTail = 1
ignoreOlderThan = 1d

jrodman · ‎04-14-2014

Splunk does not reindex anything, at least by design or intent. Avoiding reindexing should never require followTail. If you have a scenario where it does, that's a separate question/answer.

pj_elia · ‎01-13-2016

One scenario where I wanted to avoid re-indexing data was with our DR or backup intermediate forwarder. The backup intermediate forwarder has splunk disabled until needed in a disaster recovery scenario. All the inputs, (syslog archives, and NFS mounted filesystems with aggregated logs) on the backup intermediate forwarder are the same as the inputs on the primary intermediate forwarder. When bringing up the back-up intermediate forwarder, it would read, forward all the historical data from when it was last run which needs to be avoided. In this case, doesn't using followTail = 1 make sense? If not, what would be a better scenario for a back-up intermediate forwarder?

jrodman · ‎09-05-2012

The short version is: do not use followTail.

(At the risk of diluting this post, as amrit points out, ignoreOlderThan may do what you need more clearly and understandably. Documenting that setting here seems like overkill though.)

The long answer is: followTail can be used to cause splunk to permanently set data in old log files to be ignored. This can be useful for cases where there is a very large amount of historical data present, and where the work required to index and provide searching for this historical data is not considered a good tradeoff for the timeliness of beginning to use the current data.

To use followTail safely, the following steps can be used:

Set up a monitor stanza to match the files you will want to handle, for which you only want data arriving after the initial glance.
For this stanza, enable followTail = true
Restart, or start splunk, to make this setting active
Wait enough time to mark all these files as read (say, a minute for every few thousand files). This can be slower if tailing is held back for any normal reasons.
Remove the followTail setting from your stanza
Restart splunk without this setting.

DO NOT leave followTail enabled.

In short, leaving followTail enabled in an ongoing fashion is not going to behave as you might imagine. Likely there will end up being edge cases where not all data you expect to be indexed will be indexed.

The following problems exist at later dates if you leave followTail enabled in an ongoing fashion:

followTail cannot reliably correctly handle files created in the monitored location while Splunk is down, or in the process of restarting.
followTail cannot reliably correctly handle files that are created in the monitored location while splunk is in the process of initially scanning for existing files (a second or so to many minutes, depending upon whether tens or hundreds of thousands of files are being monitored).
For files where the beginning of the file is correctly skipped as desired, atomic replacements of the file contents (may happen in log rotation cases) can cause the beginning of the new contents to be skipped past as well.

BenAveling · ‎10-02-2013

I can see why writing a reliable follow tail is inherently hard.

But isn't it a problem that always exists, regardless of whether or not the follow tail flag is ticked?

i.e. If you don't have a way to know if you've read a record before then your options are to discard it, and possibly lose data, or to index it, and possibly duplicate data.

If so, then would it be fair to say that follow-tail=true means that records that are potential duplicates are discarded and that follow-tail=false means that records that are potential duplicates are indexed?

jrodman · ‎10-02-2013

Right now, as an edge-case feature in the first place there's not a lot of pressure to rebuild a totally different replacement. If you think we're dumb or that reliable filtering of old data is important to you, then please do tell sales, support, etc. The normal channel is to file an enhancement request with Splunk Support.

jrodman · ‎10-02-2013

That's the relatively easy-to-understand problem, and in some environments you can assure yourself that you're not ever going to be adding new filenames, or if you are that you don't care about missing the beginning of the first time the filename appears. However there are additional edge cases that are impossible to handle with the intersection of the concept of followTail and the overall design of Splunk Tailing. One of the two has to be changed pretty fundamentally to make it reliable. continued...

jrodman · ‎10-02-2013

@BenAveling: No, this is essentially unchanged for Splunk 6. One of the problem with FollowTail is that it's impossible to succinctly describe. For example in that copy, what does "sees it" mean? What file? What is the first time? Consider the case where you are monitoring a directory in an ongoing fashion that you set up 6 months ago, and a new filename comes to exist. That's "the first time Splunk sees it", so we'll just not index some of the beginning of the file. continued..

BenAveling · ‎10-02-2013

In splunk 6, the screen "Add new Data inputs » Files & directories » Add new" says of Follow Tail that "This only applies to the file the first time Splunk sees it. After that, Splunk's internal file position records keep track of it."

Does this mean that the problems described above no longer exist?

jrodman · ‎09-06-2012

Rob Jordan:
I've added some information along these lines to inputs.conf.spec, which is the real doc for followTail. I've talked with the docs team about where this might live in the web documentation. Probably this becomes a 'best practice' description around selectively ingesting from a historical log archive, which would probably suggest ignoreOlderThan (see amrit) with a fallback to the approach listed above.

amrit · ‎09-05-2012

what about ignoreOlderThan?

Rob_Jordan · ‎09-05-2012

Thanks for the great post, Josh. I think this should be added as part of the official documentation.

Before today, my understanding was that it would just grab any new records at the end of log and would be just as reliable as starting at the beginning of the log.

This is my new understanding for the follow tail event loss potential:
followtail=1 + (log file roll or new log file) + (high volume log or high load on host or forwarder restart) = possible event loss near head of the log

I hope Splunk will update the followtail code or options to correct scenarios where events could be lost. Possibly an an option for auto-disable once all inputs have been scanned once or somehow compare splunkd startup time to event time in the log being indexed?

Thanks,

Rob

tgow · ‎09-05-2012

Set this flag when you don't want to ingest all of the historical data in the file you are monitoring. Similar to doing a tail -f filename in Unix. Splunk will only send the latest data.

jrodman · ‎09-05-2012

Folks, let's not vote tgow's post down. It's not wrong, and he or she answered correctly. Yes, we posted this as a softball to talk about the problems with followTail, but tgow's answer was first, and correct. We can vote mine up if you like.

When is it appropriate to set followTail to 'true'?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?

When is it appropriate to set followTail to 'true'?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?