I would caveat that there are design limitations with the current tailing system that will probably prevent monitoring a pool of file effectively in the millions range.
Typically, it's true that at very large numbers of files, the aggregate data rate becomes a problem first (how much data can the forwarder digest per second (cpu limited), how fast can it transmit over the network (sometimes cpu limited, sometimes network limited), or most commonly how fast can the indexing tier push to disk per second in aggregate (tricky, can involve contention with searches). This means that the ability of tailing to read from large NUMBERS of files typically isn't relevant, as with very large installations the above problems are hit first.
This means that when data is "spotty", almost always this means that the system as a whole is not able to keep up with the aggregate data rate, not that there is a problem with tailing (file monitoring) itself. Of course we need to look at the system diagnostics to confirm this, but file count would not be the first guess.
Aside: As of 6.3 Splunk Enterprise (and other downstream splunkd products) Splunk has implemented parallel pipelines, allowing use of more cpu to process data, which should lower some of those classic cpu bottlenecks, at the price of higher total core count in use for incoming data.
Regardless of those truths, tailing can still get into trouble in two sorts of situations:
Situation 1, uncached metadata in the hundreds of thousands:
In brief, if the files are stored somewhere that the current size and time can remain "warm", like in caches, then 10k to 100k files can work okay, but if files are somewhere this cannot be done, like NFS, the metadata requests are likely too demanding at file numbers like 100k.
Situation 2, millions of files:
Into the millions of monitored files, either per-file memory cost of tracking the files, or the I/O cost of retreiving the curent size or time of the files will become too large.
Situation 1 in detail:
Tailling checks frequently for whether files have changed, in order to queue those files to be read promptly. Because the goal is to achieve near-realtime in well-maintained conditions, the max intended wait time between checks is around a second. (There are other timers for error conditions, files currently open, etc). This means a minimum intended 10,000 to 100,000 (relative to file count) stat() calls for unix or GetFileInfo calls for windows every second. If the backing data (the file size and time) information is in memory, this isn't a big problem. However, if the files are stored on a storage system that cannot cache the information locally (some types of networked or clustered filesystems) that may result in a unworkably high number of I/O operations per second.
The file monitoring code will gracefully degrade if it cannot achieve this intended schedule of file checks. The files will still be checked, and changed files will still be read from. However, not meeting that intended rate of file checks will typically mean that the storage system is being significantly taxed by the random IOPS of metadata retreival. This will lessen its capability to provide the practical file data, and on a shared-function storage system could introduce contention with other applications as well.
Situation 2 in details:
It may be surprising, but it's unavoidably necessary for the file monitoring component to keep some amount of memory in use for every discovered file (even files you rule out with controls like IgnoreOlderThan). This means that as the number of files grows to infinity, the ram needed by file monitoring will too. In practice, 100k files will use a significant amount of ram (perhaps hundreds of megabytes), but millions of files will easily reach multiple gigabytes. That can be accomodated (with grumbling perhaps), but at some point, e.g. 100million files, it will not work.
More likely to be a problem is the same issue from Situation 1. With millions of files, it becomes more likely that the operating system's cacheing logic will not keep all the file status information in ram, or simply that the system calls to request that information will become too much of a cpu bottleneck for the system to perform smoothly.
If there is a true need (especially if growing), to monitor a set of multi-millions of files from one installation, please present that case very actively through both support and sales channels.
... View more