Mark, an exact recommendation depends on a few more details about your setup. It's recommended to avoid monitoring many hundreds of thousands of files in a single Splunk instance, as the current implementation can be heavy on memory usage - although if you're on a very beefy server, it's possible you won't care about that much. Scaling up to two million files actively monitored in a single instance is an untested scenario, so hopefully your data is arranged in such a way that we can ingest it in batches.
Before we get to that, you're correct that consolidating your syslog-ng files is a smart move. In general, monitoring fewer files is a good thing. Hoever, it's important to remember that you should not lump different types of log data into a single file. You should keep apache, sendmail, etc all in separate files, but you can certainly combine the streams coming from different hosts running the same application type. This will allow you to continue effectively using sourcetypes.
For the main issue with millions of log files, some questions come to mind:
Growth is expected to be upwards of 2 million files per month - are we only going to be monitoring new files, or do you also want to index the 2M files from April, the 2M from March, and so on?
What is the topology here? What can we expect in terms of data transport from the original log source to Splunk indexer? Are these 2M files distributed amongst various server instances, and you expect to use the Universal Forwarder on each server? Or are they instead being collected centrally, and you're looking to index the logs over NFS? If it's the latter, we may want to use the [DESTRUCTIVE] "sinkhole" input method and copy the logs over in batches.
Is there a directory hierarchy for the 2M files that will allow us to efficiently blacklist out known old data? "ignoreOlderThan" is certainly helpful in speeding up file tracking, but there is still the startup-time cost of gathering each file's data, and each blacklisted file is still tracked, meaning the memory is still being used. Blacklisting an entire subdirectory is much more efficient, as we will simply avoid recursing into the directory.
The above details are important for any large-ish deployment - once we have a better picture of your scenario, we can provide a more concrete list of steps to get your data flowing. And if you're experimenting on your own, you may want to have a look at this script to get an idea of what the Tailing Processor is doing at any given moment: http://blogs.splunk.com/2011/01/02/did-i-miss-christmas-2/
Amrit
... View more