Getting Data In

How can I improve the performance of Splunk monitoring hundreds of active files?

hulahoop
Splunk Employee
Splunk Employee

When Splunk monitors hundreds/thousands of files, there seems to be a long lag between the time the event is generated and the time Splunk indexes the event and makes it searchable. In the worst cases, this lag can be many minutes, 15 minutes or more. What can I do to increase indexing throughput in this scenario?

1 Solution

hulahoop
Splunk Employee
Splunk Employee

When installing Splunk, the default settings may not account for usage outside the norm. Monitoring many or hundreds of active files falls in this category.

There are 2 settings you can adjust in limits.conf to increase the indexing throughput when a large number of active files is involved:

[inputproc]

max_fd = <integer>
* Maximum number of file descriptors that Splunk can use in the Select Processor.
* The maximum value honored is half the current number of allowed file descriptors per process. (ulimit -n /setrlimit NOFILES)
* If a value chosen is higher than the maximum allowed value, the maximum value is used instead.
* Defaults to 32.

time_before_close = <integer>
* Modtime delta required before Splunk can close a file on EOF.
* Tells the system not to close files that have been updated in past <integer> seconds.
* Defaults to 5.

For example, these settings can increase the number of files Splunk actively monitors while reducing the rate at which Splunk recycles file descriptors:

[inputproc]
max_fd = 256
time_before_close = 2

A more in-depth discussion on Splunk’s file monitoring system follows.

In order to understand Splunk file monitoring it is useful to know:

  • Each Splunk instance has a single monitoring thread
  • One file descriptor is used to per source
  • File descriptors are recycled once EOF is reached
  • The default number of file descriptors used by Splunk is 32 (in limits.conf: max_fd = 32)
  • For most Unix file systems, the max fds allocated to a single program is 1024

Splunk monitors files using a sliding window. At startup, Splunk will create the configured number of file descriptors in order to save some overhead in opening and closing fds. From this pool of fds, Splunk will begin monitoring the configured data inputs. When a fd reaches EOF, the fd is returned to the pool and immediately begins monitoring the next source in the queue.

In past versions, Splunk created one thread per source. The overhead of managing the threads and context switching defeated the performance gains of monitoring files in parallel. Ultimately, Splunk is still constrained by I/O. By using a single thread, the context switching can be avoided and Splunk can better maximize the I/O throughput.

The number of file descriptors and throughput is inversely proportional. The higher the number of fds, the lower the throughput per file descriptor. Therefore, increasing max_fd beyond a certain point will invoke diminishing returns. We believe this point to be about 256.

Please Note: File monitoring improvements in Splunk 4.1 will deliver a significant performance increase. It is not clear if this tuning will be required in 4.1.

Also Note: This tuning does not affecting indexing of gzip files. If you have many gzip files, then consider uncompressing them first to take advantage of Splunk's multi-threaded file monitoring. Splunk handles gzip files sequentially.

View solution in original post

gfriedmann
Communicator

In my case, we have about 1600 actively written files for a syslog archive. About 30GB / day to disk.

I think it may be considered a "bad" practice, but i avoid the extra disk read IO and CPU overhead by sending that data to splunk in a single TCP syslog pipe. I use a transform to extract host from the standard message itself. I use other transforms to assign sourcetypes as needed.

I think there are two main caveats with this approach.

  1. You lose ability to auto-sourcetype by individual file source (for syslog sources)
  2. If splunk goes down or restarts, you only have as much buffer as your syslog forwarder can handle. This may be no buffer or a queue in RAM or other auto-handled spooling conventions.

I don't mind assigning sourcetypes as needed because i got tired of the cryptic and inconsistent auto-sourcetype names for sources that had low log volumes. We still collect non-syslog files too.

And also in my environment, nobody cries if we miss a few events here or there.

I hope this answer also helps someone.

0 Karma

hulahoop
Splunk Employee
Splunk Employee

When installing Splunk, the default settings may not account for usage outside the norm. Monitoring many or hundreds of active files falls in this category.

There are 2 settings you can adjust in limits.conf to increase the indexing throughput when a large number of active files is involved:

[inputproc]

max_fd = <integer>
* Maximum number of file descriptors that Splunk can use in the Select Processor.
* The maximum value honored is half the current number of allowed file descriptors per process. (ulimit -n /setrlimit NOFILES)
* If a value chosen is higher than the maximum allowed value, the maximum value is used instead.
* Defaults to 32.

time_before_close = <integer>
* Modtime delta required before Splunk can close a file on EOF.
* Tells the system not to close files that have been updated in past <integer> seconds.
* Defaults to 5.

For example, these settings can increase the number of files Splunk actively monitors while reducing the rate at which Splunk recycles file descriptors:

[inputproc]
max_fd = 256
time_before_close = 2

A more in-depth discussion on Splunk’s file monitoring system follows.

In order to understand Splunk file monitoring it is useful to know:

  • Each Splunk instance has a single monitoring thread
  • One file descriptor is used to per source
  • File descriptors are recycled once EOF is reached
  • The default number of file descriptors used by Splunk is 32 (in limits.conf: max_fd = 32)
  • For most Unix file systems, the max fds allocated to a single program is 1024

Splunk monitors files using a sliding window. At startup, Splunk will create the configured number of file descriptors in order to save some overhead in opening and closing fds. From this pool of fds, Splunk will begin monitoring the configured data inputs. When a fd reaches EOF, the fd is returned to the pool and immediately begins monitoring the next source in the queue.

In past versions, Splunk created one thread per source. The overhead of managing the threads and context switching defeated the performance gains of monitoring files in parallel. Ultimately, Splunk is still constrained by I/O. By using a single thread, the context switching can be avoided and Splunk can better maximize the I/O throughput.

The number of file descriptors and throughput is inversely proportional. The higher the number of fds, the lower the throughput per file descriptor. Therefore, increasing max_fd beyond a certain point will invoke diminishing returns. We believe this point to be about 256.

Please Note: File monitoring improvements in Splunk 4.1 will deliver a significant performance increase. It is not clear if this tuning will be required in 4.1.

Also Note: This tuning does not affecting indexing of gzip files. If you have many gzip files, then consider uncompressing them first to take advantage of Splunk's multi-threaded file monitoring. Splunk handles gzip files sequentially.

saranya_fmr
Communicator

Thankyou @hulahoop yup it did answer my query 🙂

0 Karma

saranya_fmr
Communicator

Hi @hulahoop ,

1) Is this update of limits.conf done on the forwarder or Indexer?

[inputproc]
max_fd = 256
time_before_close = 2

2) If forwarder, Can this updation of limitsconf be done via a deployment-app? and will this override the value in /etc/system/default/limits.conf ?
OR
Should I update it in $SPLUNK_HOME/splunk/etc/system/local/limits.conf ??

0 Karma

sloshburch
Splunk Employee
Splunk Employee

Remember that the finer elements of tuning could be best addressed with support. They can explore the larger context of what you are trying to achieve and provide the most targeted recommendation.

0 Karma

hulahoop
Splunk Employee
Splunk Employee

Hello, the config should be applied on the instance which is collecting the data. This is usually the forwarder.

Secondly, best practice is no config should be updated or edited in the default folder. You can use Deployment Server, and propagate to the local folder or an app folder.

Does this answer your questions?

hexx
Splunk Employee
Splunk Employee

The "ignoreOlderThan" inputs.conf parameter introduced in 4.2 deserves a mention :

ignoreOlderThan =

See inputs.conf.spec for more.

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...