There is a bug in the 4.1.3 and previous builds that is triggered by the slow NFS stat() calls, that much we know for sure. If your Splunk instance just stops indexing data, or stops picking up new files, then it's highly likely you are hitting this bug. Splunk should never stop indexing data from monitored files and discovering new files as long as new data is available.
Increasing the max_fd setting will increase the number of FD's we can keep open for files that we've finished reading. The files will remain open until the time_before_close seconds. This doesn't mean that Splunk will run faster if you increase it to a huge number, it just means that files stay open longer so we may pick up new data more quickly. If files aren't updated once read, then it's of little use to you.
Theoretically, there should be no limit to the number of files that Splunk can monitor. We depend on the OS to tell us which files have changed, so if the OS knows then Splunk will know too. An important distinction here is a 'live' file vs a 'static' file. Live files are currently being updated/written to, and Splunk will pick up new data from here as long as it keeps being added. Static files should be indexed and then ignored indefinitely.
The concept of 'real-time' can also play a major role here. How quickly do you want Splunk to pick up data once it's written to a file? If your requirement is that Splunk should display the data as quickly as possible, then you want to limit the number of live files that Splunk is monitoring. If you have 20,000 live files then Splunk will not be able to keep up to date with every single file at the same time. However, if you have 20,000 files and only 50 of them are live, then Splunk should be easily able to keep up - are we starting to make sense yet?
Our testing has shown that when tailing 10,000 live files, you can expect somewhere between 30 seconds and 1 minute lag. Those timings should increase linearly as you increase the number of files, so at 20,000 files you can expect 1 - 2 minutes delay.
There's no hard and fast rule here, the speed of Splunk is dependent on the hardware resources available and how the instance is tuned - number of CPU cores, number of indexing threads, number of FD's, speed of the disk Splunk is writing to, data-segmentation etc. The faster Splunk can write to the index, the faster it can pull new data from files. Testing is the only true way to gauge the expected performance of your hardware, with your data.
... View more