Getting Data In

is there a limit on the number of files splunk can monitor?

Builder

is there a limit on the number of files splunk can monitor? Say for example if i have a directory with 100k+ files. Is it reasonable for me to expect indexing latency/missing event issues?

1 Solution

Builder

Technically there is no limitation on the amount of files that one splunk instance can monitor. We will read from disk as fast as the underlying storage will let us (assuming maxkbps=0 and a healthy unsaturated indexing tier). That being said, the real world limiting factor is how fast we can read from disk and/or forward data to the indexing tier. If these files never grow past 20MB we will always use tailing processor to read them. Once they grow beyond 20MB they get defered to batch reader. In this case, if all 100k files are being constantly updated, we can run into a case where tailingproc/batchreader will be stuck reading a file. This is because, by default, both batch and tailingproc will read until EOF of file, and then wait 3 seconds (time before close) before switching to the next file in the queue. So, you can very easily run into a case where we are stuck reading a file for a long time and we fail to read other files before they are rotated or deleted. You can use the tailing processor rest endpoint to determine which files are currently being read and in the queue.

(pre 6.3 - i will add more details about new features in 6.3 shortly)
Based on experience, 100k being monitored by a single instance will always lead to these kind of issues and/or high indexing latency. I would highly recommend that the customer split the monitoring workload between at least 2 instances.

Also, you can check maxkbps in limits.conf and make sure it is tuned to an acceptable value. We will log warn messages about hitting maxkbps as this will indirectly throttle how fast we are reading files.

Additional comments from jrodman

Typically, it's true that at very large numbers of files, the aggregate data rate becomes a problem first (how much data can the forwarder digest per second (cpu limited), how fast can it transmit over the network (sometimes cpu limited, sometimes network limited), or most commonly how fast can the indexing tier push to disk per second in aggregate (tricky, can involve contention with searches). This means that the ability of tailing to read from large NUMBERS of files typically isn't relevant, as with very large installations the above problems are hit first.

This means that when data is "spotty", almost always this means that the system as a whole is not able to keep up with the aggregate data rate, not that there is a problem with tailing (file monitoring) itself. Of course we need to look at the system diagnostics to confirm this, but file count would not be the first guess.

Aside: As of 6.3 Splunk Enterprise (and other downstream splunkd products) Splunk has implemented parallel pipelines, allowing use of more cpu to process data, which should lower some of those classic cpu bottlenecks, at the price of higher total core count in use for incoming data.

Regardless of those truths, tailing can still get into trouble in two sorts of situations:

Situation 1, uncached metadata in the hundreds of thousands:

In brief, if the files are stored somewhere that the current size and time can remain "warm", like in caches, then 10k to 100k files can work okay, but if files are somewhere this cannot be done, like NFS, the metadata requests are likely too demanding at file numbers like 100k.

Situation 2, millions of files:

Into the millions of monitored files, either per-file memory cost of tracking the files, or the I/O cost of retreiving the curent size or time of the files will become too large.

Situation 1 in detail:

Tailling checks frequently for whether files have changed, in order to queue those files to be read promptly. Because the goal is to achieve near-realtime in well-maintained conditions, the max intended wait time between checks is around a second. (There are other timers for error conditions, files currently open, etc). This means a minimum intended 10,000 to 100,000 (relative to file count) stat() calls for unix or GetFileInfo calls for windows every second. If the backing data (the file size and time) information is in memory, this isn't a big problem. However, if the files are stored on a storage system that cannot cache the information locally (some types of networked or clustered filesystems) that may result in a unworkably high number of I/O operations per second.

The file monitoring code will gracefully degrade if it cannot achieve this intended schedule of file checks. The files will still be checked, and changed files will still be read from. However, not meeting that intended rate of file checks will typically mean that the storage system is being significantly taxed by the random IOPS of metadata retreival. This will lessen its capability to provide the practical file data, and on a shared-function storage system could introduce contention with other applications as well.

Situation 2 in details:

It may be surprising, but it's unavoidably necessary for the file monitoring component to keep some amount of memory in use for every discovered file (even files you rule out with controls like IgnoreOlderThan). This means that as the number of files grows to infinity, the ram needed by file monitoring will too. In practice, 100k files will use a significant amount of ram (perhaps hundreds of megabytes), but millions of files will easily reach multiple gigabytes. That can be accomodated (with grumbling perhaps), but at some point, e.g. 100million files, it will not work.

More likely to be a problem is the same issue from Situation 1. With millions of files, it becomes more likely that the operating system's cacheing logic will not keep all the file status information in ram, or simply that the system calls to request that information will become too much of a cpu bottleneck for the system to perform smoothly.

If there is a true need (especially if growing), to monitor a set of multi-millions of files from one installation, please present that case very actively through both support and sales channels.

View solution in original post

Splunk Employee
Splunk Employee

If you actually have a single directory containing 100k files in it directly, there are many filesystems which will themselves degrade horribly. The process of asking the operating system for the list of the 100k contained files can be exhaustively expensive on some systems. I hope that's not the propsed situation

(Hashed directories are the typical solution for making this not break, and some systems can handle it.)

0 Karma

Builder

Technically there is no limitation on the amount of files that one splunk instance can monitor. We will read from disk as fast as the underlying storage will let us (assuming maxkbps=0 and a healthy unsaturated indexing tier). That being said, the real world limiting factor is how fast we can read from disk and/or forward data to the indexing tier. If these files never grow past 20MB we will always use tailing processor to read them. Once they grow beyond 20MB they get defered to batch reader. In this case, if all 100k files are being constantly updated, we can run into a case where tailingproc/batchreader will be stuck reading a file. This is because, by default, both batch and tailingproc will read until EOF of file, and then wait 3 seconds (time before close) before switching to the next file in the queue. So, you can very easily run into a case where we are stuck reading a file for a long time and we fail to read other files before they are rotated or deleted. You can use the tailing processor rest endpoint to determine which files are currently being read and in the queue.

(pre 6.3 - i will add more details about new features in 6.3 shortly)
Based on experience, 100k being monitored by a single instance will always lead to these kind of issues and/or high indexing latency. I would highly recommend that the customer split the monitoring workload between at least 2 instances.

Also, you can check maxkbps in limits.conf and make sure it is tuned to an acceptable value. We will log warn messages about hitting maxkbps as this will indirectly throttle how fast we are reading files.

Additional comments from jrodman

Typically, it's true that at very large numbers of files, the aggregate data rate becomes a problem first (how much data can the forwarder digest per second (cpu limited), how fast can it transmit over the network (sometimes cpu limited, sometimes network limited), or most commonly how fast can the indexing tier push to disk per second in aggregate (tricky, can involve contention with searches). This means that the ability of tailing to read from large NUMBERS of files typically isn't relevant, as with very large installations the above problems are hit first.

This means that when data is "spotty", almost always this means that the system as a whole is not able to keep up with the aggregate data rate, not that there is a problem with tailing (file monitoring) itself. Of course we need to look at the system diagnostics to confirm this, but file count would not be the first guess.

Aside: As of 6.3 Splunk Enterprise (and other downstream splunkd products) Splunk has implemented parallel pipelines, allowing use of more cpu to process data, which should lower some of those classic cpu bottlenecks, at the price of higher total core count in use for incoming data.

Regardless of those truths, tailing can still get into trouble in two sorts of situations:

Situation 1, uncached metadata in the hundreds of thousands:

In brief, if the files are stored somewhere that the current size and time can remain "warm", like in caches, then 10k to 100k files can work okay, but if files are somewhere this cannot be done, like NFS, the metadata requests are likely too demanding at file numbers like 100k.

Situation 2, millions of files:

Into the millions of monitored files, either per-file memory cost of tracking the files, or the I/O cost of retreiving the curent size or time of the files will become too large.

Situation 1 in detail:

Tailling checks frequently for whether files have changed, in order to queue those files to be read promptly. Because the goal is to achieve near-realtime in well-maintained conditions, the max intended wait time between checks is around a second. (There are other timers for error conditions, files currently open, etc). This means a minimum intended 10,000 to 100,000 (relative to file count) stat() calls for unix or GetFileInfo calls for windows every second. If the backing data (the file size and time) information is in memory, this isn't a big problem. However, if the files are stored on a storage system that cannot cache the information locally (some types of networked or clustered filesystems) that may result in a unworkably high number of I/O operations per second.

The file monitoring code will gracefully degrade if it cannot achieve this intended schedule of file checks. The files will still be checked, and changed files will still be read from. However, not meeting that intended rate of file checks will typically mean that the storage system is being significantly taxed by the random IOPS of metadata retreival. This will lessen its capability to provide the practical file data, and on a shared-function storage system could introduce contention with other applications as well.

Situation 2 in details:

It may be surprising, but it's unavoidably necessary for the file monitoring component to keep some amount of memory in use for every discovered file (even files you rule out with controls like IgnoreOlderThan). This means that as the number of files grows to infinity, the ram needed by file monitoring will too. In practice, 100k files will use a significant amount of ram (perhaps hundreds of megabytes), but millions of files will easily reach multiple gigabytes. That can be accomodated (with grumbling perhaps), but at some point, e.g. 100million files, it will not work.

More likely to be a problem is the same issue from Situation 1. With millions of files, it becomes more likely that the operating system's cacheing logic will not keep all the file status information in ram, or simply that the system calls to request that information will become too much of a cpu bottleneck for the system to perform smoothly.

If there is a true need (especially if growing), to monitor a set of multi-millions of files from one installation, please present that case very actively through both support and sales channels.

View solution in original post

Splunk Employee
Splunk Employee

I would caveat that there are design limitations with the current tailing system that will probably prevent monitoring a pool of file effectively in the millions range.

Typically, it's true that at very large numbers of files, the aggregate data rate becomes a problem first (how much data can the forwarder digest per second (cpu limited), how fast can it transmit over the network (sometimes cpu limited, sometimes network limited), or most commonly how fast can the indexing tier push to disk per second in aggregate (tricky, can involve contention with searches). This means that the ability of tailing to read from large NUMBERS of files typically isn't relevant, as with very large installations the above problems are hit first.

This means that when data is "spotty", almost always this means that the system as a whole is not able to keep up with the aggregate data rate, not that there is a problem with tailing (file monitoring) itself. Of course we need to look at the system diagnostics to confirm this, but file count would not be the first guess.

Aside: As of 6.3 Splunk Enterprise (and other downstream splunkd products) Splunk has implemented parallel pipelines, allowing use of more cpu to process data, which should lower some of those classic cpu bottlenecks, at the price of higher total core count in use for incoming data.

Regardless of those truths, tailing can still get into trouble in two sorts of situations:

Situation 1, uncached metadata in the hundreds of thousands:

In brief, if the files are stored somewhere that the current size and time can remain "warm", like in caches, then 10k to 100k files can work okay, but if files are somewhere this cannot be done, like NFS, the metadata requests are likely too demanding at file numbers like 100k.

Situation 2, millions of files:

Into the millions of monitored files, either per-file memory cost of tracking the files, or the I/O cost of retreiving the curent size or time of the files will become too large.

Situation 1 in detail:

Tailling checks frequently for whether files have changed, in order to queue those files to be read promptly. Because the goal is to achieve near-realtime in well-maintained conditions, the max intended wait time between checks is around a second. (There are other timers for error conditions, files currently open, etc). This means a minimum intended 10,000 to 100,000 (relative to file count) stat() calls for unix or GetFileInfo calls for windows every second. If the backing data (the file size and time) information is in memory, this isn't a big problem. However, if the files are stored on a storage system that cannot cache the information locally (some types of networked or clustered filesystems) that may result in a unworkably high number of I/O operations per second.

The file monitoring code will gracefully degrade if it cannot achieve this intended schedule of file checks. The files will still be checked, and changed files will still be read from. However, not meeting that intended rate of file checks will typically mean that the storage system is being significantly taxed by the random IOPS of metadata retreival. This will lessen its capability to provide the practical file data, and on a shared-function storage system could introduce contention with other applications as well.

Situation 2 in details:

It may be surprising, but it's unavoidably necessary for the file monitoring component to keep some amount of memory in use for every discovered file (even files you rule out with controls like IgnoreOlderThan). This means that as the number of files grows to infinity, the ram needed by file monitoring will too. In practice, 100k files will use a significant amount of ram (perhaps hundreds of megabytes), but millions of files will easily reach multiple gigabytes. That can be accomodated (with grumbling perhaps), but at some point, e.g. 100million files, it will not work.

More likely to be a problem is the same issue from Situation 1. With millions of files, it becomes more likely that the operating system's cacheing logic will not keep all the file status information in ram, or simply that the system calls to request that information will become too much of a cpu bottleneck for the system to perform smoothly.

If there is a true need (especially if growing), to monitor a set of multi-millions of files from one installation, please present that case very actively through both support and sales channels.

0 Karma

New Member

There is likely to be an OS limit on how many files Splunk (or Splunk Forwarder) can have open at any one time. Don't forget to refer to OS documentation in order to up the limit for the user Splunk is running under. In linux land this would mean editing /etc/limits.conf for both root and the user in addition to restarting the host.

0 Karma

Splunk Employee
Splunk Employee

That's true, but I'm not speaking of the number of open files, but the number of files that splunk has discovered in the configured monitored location.

0 Karma

Splunk Employee
Splunk Employee

Are you sure batchreader implements time-before-close itself? It should simply hand the file back to tailing, but I don't know for sure what it does. The batchreader/tailing design last I looked was under review so I don't have the complete authority for the current 6.3 state.

0 Karma

Builder

Last i checked with amrit, yes. Batch and tail, i was told, are identical besides the fact that we implement a size limit for batch.

0 Karma

Splunk Employee
Splunk Employee

[batch://] inputs and [monitor://] inputs are both handled by tailing & batchreader. batchreader handles the over-20MB-to-read files, for both batch:// and monitor://, however batchreader is a distinct component, which is what you and I both wrote about, and it does not implement the full set of logic.

This is all going to be soon out of date anyway, as we're moving to a model where we have the finder-thread and multiple reader threads (one per pipeline), so the concept of "batch reader" should more or less go away.

0 Karma

Builder

CarlosDanger approves of this message.

0 Karma