we use approach to collect logs on syslog and than point Splunk on logs with Files & Directories inputs. All inputs were located on the indexer (single-node deployment).
It was deployed another node as Heavy Forwarder, also with the purpose to move inputs there.
Each folder has logs from particular asset, where data is collected and separated by date (deep structure).
Previously we've moved about 30 inputs, and it worked nice and quick. Now we've moved around 700 inputs there.
To avoid license violation (when Splunk potentially might re-index all old logs) we've added a stanza ignoreOlderThan=1d for each input.
After restarting Splunk on the HF node, it takes a long time to start forwarding events to the indexer.
As I understand it re-reads all the file structure to keep this "ignoreOld" policy.
Question - how can we improve the process, what may we change in confihurations to speed-up processing and forwarding data in case new Splunk restarts on HF?
Your setup sounds quite similar than ours. We collect syslog with rsyslog and put them into a structured folder system like /var/log/rsyslog-splunk/uc////logfile.log
We have also several hundreds of files on our servers. First of all make sure that you configure the ulimits as described by @lakshman239 (https://docs.splunk.com/Documentation/Splunk/7.2.4/Installation/Systemrequirements)
We also raised the thruput limits (limits.conf):
maxKBps = 0
[thruput] maxKBps = <integer> If specified and not zero, this limits the speed through the thruput processor to the specified rate in kilobytes per second. To control the CPU load while indexing, use this to throttle the number of events this indexer processes to the rate (in KBps) you specify.
The second config we did was raising the pipelines to 4 (server.conf).
parallelIngestionPipelines = 4
parallelIngestionPipelines = <integer> * The number of discrete data ingestion pipeline sets to create for this instance. * A pipeline set handles the processing of data, from receiving streams of events through event processing and writing the events to disk. * An indexer that operates multiple pipeline sets can achieve improved performance with data parsing and disk writing, at the cost of additional CPU cores. * For most installations, the default setting of "1" is optimal. * Use caution when changing this setting. Increasing the CPU usage for data ingestion reduces available CPU cores for other tasks like searching. * NOTE: Enabling multiple ingestion pipelines can change the behavior of some settings in other configuration files. Each ingestion pipeline enforces the limits of the following settings independently: 1. maxKBps (in the limits.conf file) 2. max_fd (in the limits.conf file) 3. maxHotBuckets (in the indexes.conf file) 4. maxHotSpanSecs (in the indexes.conf file) * Default: 1
I assume you have setup ulimits as per splunk's recommendations on the Indexer and heavy forwarders.
Also, if the old files are read and indexed, you can either delete or archive them on to a different folder or name, so they are not read again.
Additionally, you can read only a certain number of directories or files (using name pattern/regex) and after they are indexed, you can add additional files/folders in stages [ you may need to have more than one monitor stanza as required]