Getting Data In

When is it appropriate to set followTail to 'true'?

hexx
Splunk Employee
Splunk Employee

What is the appropriate use of 'followTail' for file monitor inputs in inputs.conf? In which cases is it useful to set this to 'true'? What are the implications of doing so?

1 Solution

jrodman
Splunk Employee
Splunk Employee

The short version is: do not use followTail.

(At the risk of diluting this post, as amrit points out, ignoreOlderThan may do what you need more clearly and understandably. Documenting that setting here seems like overkill though.)

The long answer is: followTail can be used to cause splunk to permanently set data in old log files to be ignored. This can be useful for cases where there is a very large amount of historical data present, and where the work required to index and provide searching for this historical data is not considered a good tradeoff for the timeliness of beginning to use the current data.

To use followTail safely, the following steps can be used:

  1. Set up a monitor stanza to match the files you will want to handle, for which you only want data arriving after the initial glance.
  2. For this stanza, enable followTail = true
  3. Restart, or start splunk, to make this setting active
  4. Wait enough time to mark all these files as read (say, a minute for every few thousand files). This can be slower if tailing is held back for any normal reasons.
  5. Remove the followTail setting from your stanza
  6. Restart splunk without this setting.

DO NOT leave followTail enabled.

In short, leaving followTail enabled in an ongoing fashion is not going to behave as you might imagine. Likely there will end up being edge cases where not all data you expect to be indexed will be indexed.

The following problems exist at later dates if you leave followTail enabled in an ongoing fashion:

  • followTail cannot reliably correctly handle files created in the monitored location while Splunk is down, or in the process of restarting.
  • followTail cannot reliably correctly handle files that are created in the monitored location while splunk is in the process of initially scanning for existing files (a second or so to many minutes, depending upon whether tens or hundreds of thousands of files are being monitored).
  • For files where the beginning of the file is correctly skipped as desired, atomic replacements of the file contents (may happen in log rotation cases) can cause the beginning of the new contents to be skipped past as well.

View solution in original post

edonze
Path Finder

If it is more important for me to not re-index every line of a file with current time stamps than have the possibility of missing some lines of a file on the off chance that the splunkforwarder service isn't running, is there a reason not to use followtail?

0 Karma

mario_traf
New Member

edonze, i changed the HIST settings to change add in a epoch time:
readonly HISTSIZE=5000
readonly HISTTIMEFORMAT="%y/%m/%d %T "
readonly HISTCONTROL=ignoreboth
i also centralised all bash history using HISTFILE into one location so that all history files could be monitored easily

0 Karma

the_wolverine
Champion

What jrodman said - this is probably not related to followTail. It might be another setting that is causing it: http://answers.splunk.com/answers/8059/splunk-duplicating-events-every-time-file-changes

0 Karma

edonze
Path Finder

Since I commented followTail = 1 in the stanza below, it consumes the entire file each time it is updated, so instead of a few entries with current timestamp, I get hundreds of lines with a current timestamp, even though the previous lines have been consumed before.

[monitor:///root/.bash_history]
index = os
sourcetype = cmdhistory
source = root
disabled = 0
#followTail = 1
ignoreOlderThan = 1d

0 Karma

jrodman
Splunk Employee
Splunk Employee

Splunk does not reindex anything, at least by design or intent. Avoiding reindexing should never require followTail. If you have a scenario where it does, that's a separate question/answer.

0 Karma

pj_elia
Engager

One scenario where I wanted to avoid re-indexing data was with our DR or backup intermediate forwarder. The backup intermediate forwarder has splunk disabled until needed in a disaster recovery scenario. All the inputs, (syslog archives, and NFS mounted filesystems with aggregated logs) on the backup intermediate forwarder are the same as the inputs on the primary intermediate forwarder. When bringing up the back-up intermediate forwarder, it would read, forward all the historical data from when it was last run which needs to be avoided. In this case, doesn't using followTail = 1 make sense? If not, what would be a better scenario for a back-up intermediate forwarder?

0 Karma

jrodman
Splunk Employee
Splunk Employee

The short version is: do not use followTail.

(At the risk of diluting this post, as amrit points out, ignoreOlderThan may do what you need more clearly and understandably. Documenting that setting here seems like overkill though.)

The long answer is: followTail can be used to cause splunk to permanently set data in old log files to be ignored. This can be useful for cases where there is a very large amount of historical data present, and where the work required to index and provide searching for this historical data is not considered a good tradeoff for the timeliness of beginning to use the current data.

To use followTail safely, the following steps can be used:

  1. Set up a monitor stanza to match the files you will want to handle, for which you only want data arriving after the initial glance.
  2. For this stanza, enable followTail = true
  3. Restart, or start splunk, to make this setting active
  4. Wait enough time to mark all these files as read (say, a minute for every few thousand files). This can be slower if tailing is held back for any normal reasons.
  5. Remove the followTail setting from your stanza
  6. Restart splunk without this setting.

DO NOT leave followTail enabled.

In short, leaving followTail enabled in an ongoing fashion is not going to behave as you might imagine. Likely there will end up being edge cases where not all data you expect to be indexed will be indexed.

The following problems exist at later dates if you leave followTail enabled in an ongoing fashion:

  • followTail cannot reliably correctly handle files created in the monitored location while Splunk is down, or in the process of restarting.
  • followTail cannot reliably correctly handle files that are created in the monitored location while splunk is in the process of initially scanning for existing files (a second or so to many minutes, depending upon whether tens or hundreds of thousands of files are being monitored).
  • For files where the beginning of the file is correctly skipped as desired, atomic replacements of the file contents (may happen in log rotation cases) can cause the beginning of the new contents to be skipped past as well.

BenAveling
Path Finder

I can see why writing a reliable follow tail is inherently hard.

But isn't it a problem that always exists, regardless of whether or not the follow tail flag is ticked?

i.e. If you don't have a way to know if you've read a record before then your options are to discard it, and possibly lose data, or to index it, and possibly duplicate data.

If so, then would it be fair to say that follow-tail=true means that records that are potential duplicates are discarded and that follow-tail=false means that records that are potential duplicates are indexed?

0 Karma

jrodman
Splunk Employee
Splunk Employee

Right now, as an edge-case feature in the first place there's not a lot of pressure to rebuild a totally different replacement. If you think we're dumb or that reliable filtering of old data is important to you, then please do tell sales, support, etc. The normal channel is to file an enhancement request with Splunk Support.

0 Karma

jrodman
Splunk Employee
Splunk Employee

That's the relatively easy-to-understand problem, and in some environments you can assure yourself that you're not ever going to be adding new filenames, or if you are that you don't care about missing the beginning of the first time the filename appears. However there are additional edge cases that are impossible to handle with the intersection of the concept of followTail and the overall design of Splunk Tailing. One of the two has to be changed pretty fundamentally to make it reliable. continued...

0 Karma

jrodman
Splunk Employee
Splunk Employee

@BenAveling: No, this is essentially unchanged for Splunk 6. One of the problem with FollowTail is that it's impossible to succinctly describe. For example in that copy, what does "sees it" mean? What file? What is the first time? Consider the case where you are monitoring a directory in an ongoing fashion that you set up 6 months ago, and a new filename comes to exist. That's "the first time Splunk sees it", so we'll just not index some of the beginning of the file. continued..

BenAveling
Path Finder

In splunk 6, the screen "Add new Data inputs » Files & directories » Add new" says of Follow Tail that "This only applies to the file the first time Splunk sees it. After that, Splunk's internal file position records keep track of it."

Does this mean that the problems described above no longer exist?

0 Karma

jrodman
Splunk Employee
Splunk Employee

Rob Jordan:
I've added some information along these lines to inputs.conf.spec, which is the real doc for followTail. I've talked with the docs team about where this might live in the web documentation. Probably this becomes a 'best practice' description around selectively ingesting from a historical log archive, which would probably suggest ignoreOlderThan (see amrit) with a fallback to the approach listed above.

0 Karma

amrit
Splunk Employee
Splunk Employee

what about ignoreOlderThan?

Rob_Jordan
Explorer

Thanks for the great post, Josh. I think this should be added as part of the official documentation.

Before today, my understanding was that it would just grab any new records at the end of log and would be just as reliable as starting at the beginning of the log.

This is my new understanding for the follow tail event loss potential:
followtail=1 + (log file roll or new log file) + (high volume log or high load on host or forwarder restart) = possible event loss near head of the log

I hope Splunk will update the followtail code or options to correct scenarios where events could be lost. Possibly an an option for auto-disable once all inputs have been scanned once or somehow compare splunkd startup time to event time in the log being indexed?

Thanks,

Rob

tgow
Splunk Employee
Splunk Employee

Set this flag when you don't want to ingest all of the historical data in the file you are monitoring. Similar to doing a tail -f filename in Unix. Splunk will only send the latest data.

jrodman
Splunk Employee
Splunk Employee

Folks, let's not vote tgow's post down. It's not wrong, and he or she answered correctly. Yes, we posted this as a softball to talk about the problems with followTail, but tgow's answer was first, and correct. We can vote mine up if you like.

0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...