Getting Data In

Moving files and folders inputs to heavy forwarder

Contributor

Hi Splunkers,

we use approach to collect logs on syslog and than point Splunk on logs with Files & Directories inputs. All inputs were located on the indexer (single-node deployment).
It was deployed another node as Heavy Forwarder, also with the purpose to move inputs there.
Each folder has logs from particular asset, where data is collected and separated by date (deep structure).
Previously we've moved about 30 inputs, and it worked nice and quick. Now we've moved around 700 inputs there.
To avoid license violation (when Splunk potentially might re-index all old logs) we've added a stanza ignoreOlderThan=1d for each input.
After restarting Splunk on the HF node, it takes a long time to start forwarding events to the indexer.
As I understand it re-reads all the file structure to keep this "ignoreOld" policy.
Question - how can we improve the process, what may we change in confihurations to speed-up processing and forwarding data in case new Splunk restarts on HF?

Communicator

Hi @evelenke

Your setup sounds quite similar than ours. We collect syslog with rsyslog and put them into a structured folder system like /var/log/rsyslog-splunk/uc////logfile.log

We have also several hundreds of files on our servers. First of all make sure that you configure the ulimits as described by @lakshman239 (https://docs.splunk.com/Documentation/Splunk/7.2.4/Installation/Systemrequirements)

We also raised the thruput limits (limits.conf): maxKBps = 0

[thruput]
maxKBps = <integer>
 If specified and not zero, this limits the speed through the thruput processor to the specified 
rate in kilobytes per second.
 To control the CPU load while indexing, use this to throttle the number of events this indexer 
processes to the rate (in KBps) you specify. 

The second config we did was raising the pipelines to 4 (server.conf). parallelIngestionPipelines = 4

parallelIngestionPipelines = <integer>
* The number of discrete data ingestion pipeline sets to create for this
  instance.
* A pipeline set handles the processing of data, from receiving streams
  of events through event processing and writing the events to disk.
* An indexer that operates multiple pipeline sets can achieve improved
  performance with data parsing and disk writing, at the cost of additional 
  CPU cores. 
* For most installations, the default setting of "1" is optimal. 
* Use caution when changing this setting. Increasing the CPU usage for data 
  ingestion reduces available CPU cores for other tasks like searching.
* NOTE: Enabling multiple ingestion pipelines can change the behavior of some
  settings in other configuration files. Each ingestion pipeline enforces 
  the limits of the following settings independently:
    1. maxKBps (in the limits.conf file)
    2. max_fd (in the limits.conf file)
    3. maxHotBuckets (in the indexes.conf file)
    4. maxHotSpanSecs (in the indexes.conf file)
* Default: 1
0 Karma

Contributor

Thank you, will try during the nearest change

0 Karma

SplunkTrust
SplunkTrust

I assume you have setup ulimits as per splunk's recommendations on the Indexer and heavy forwarders.
https://docs.splunk.com/Documentation/Splunk/7.2.4/Installation/Systemrequirements

Also, if the old files are read and indexed, you can either delete or archive them on to a different folder or name, so they are not read again.

Additionally, you can read only a certain number of directories or files (using name pattern/regex) and after they are indexed, you can add additional files/folders in stages [ you may need to have more than one monitor stanza as required]

0 Karma

Contributor

Thank you, will try during the nearest change

0 Karma