Getting Data In

Why is there significant sudden slowness in ingesting between servers with Splunk UF to Splunk Heavy Forwarders?

Builder

Hi ,
Looking for an advice in troubleshooting the cause of the issue we are experiencing and how to solve it.

We have few Splunk UF(s) where we are monitoring large amount of big files to our 4 load balanced Heavy Forwarders.
The setup we have was working until last week when we have started to ingest the files with big delay ,3-6 hrs depending on the size. Previously it was taking minutes to ingest.

Best to our knowledge we didn't have any network, OS or Splunk related changes on the day when we started to experience the issue.

We tried:
1. Restart Splunk process on Splunk UF servers
2. Reboot the servers with Splunk UF
3. Per Splunk support we changed server.conf on Splunk UF server:
by adding parallelIngestionPipelines and queue sizes

parallelIngestionPipelines = 2
[queue]
maxSize = 1GB
[queue=aq]
maxSize = 20MB
[queue=aeq]
maxSize = 20MB
  1. Per Splunk support we modified limits.conf
    by adding max_fd and we had thruput set to unlimited already

    [thruput]
    maxKBps = 0
    [inputproc]
    max_fd = 200
    

    All above didn't fix the issue.
    Maybe you have experienced the similar issue. It would be great to know how it was solved
    Any advice will be appreciated!

0 Karma
1 Solution

Builder

@codebuilder , @woodcock , @lakshman239
Just an update how the issue was solved in our case.
After ruling out that the cause was any Splunk issue/configuration , our network engineer made some configuration changes across replication between different data centers in addition to the change of wan routing preference.

View solution in original post

0 Karma

Builder

@codebuilder , @woodcock , @lakshman239
Just an update how the issue was solved in our case.
After ruling out that the cause was any Splunk issue/configuration , our network engineer made some configuration changes across replication between different data centers in addition to the change of wan routing preference.

View solution in original post

0 Karma

Esteemed Legend

Ah, so slow network. Yes, that will kill things. Please do click Accept on your answer here to close the question.

0 Karma

Esteemed Legend

We see this problem all the time and it is usually due to there being way too many files co-resident with the files that you are monitoring. This typically happens because there is no housekeeping, or very languishing policy for deleting the files as they rotate. Yes, even if you are not monitoring the rotated files, they will eventually slow the forwarder down to a crawl. It usually starts when you have hundreds of files and you are crippled by the time you get to thousands. If you cannot delete the files that are way old and done, then you can create soft links to fresh files in another directory. Let me know if you need details on how to do that.

0 Karma

Motivator

When you added parallelIngestionPipelines to server.conf on the forwarders, did you also update the indexers? The default value is 1, so increasing the value on the forwarders without increasing it on the indexers will gain you no performance increase.

Also, have you checked the ulimit settings for the Splunk user and/or daemon? If not, you may want to check those, especially the open files limit. The OS default is generally 1024, which is way too low for Splunk.

0 Karma

Builder

@codebuilder,

1) All our indexers are on the Splunk Cloud so we don't have access to it. Have to check what the parallelIngestionPipelines value is with Cloud Support

2) Regarding the ulimit value for open files. During first few days of the issue "ulimit -n" was showing as set to 64000 on Splunk UF server.
At some point we rebooted it and it went down to 1024 after reboot for some reason.
Per our Unix sys admin it is set to 64000 in /etc/security/limits.conf on system level for our splunk user

0 Karma

Motivator

I suspected that may be the case. Your ulimit configurations were not honored and reverted back to the OS defaults upon reboot (expected behavior).

Depending on your OS flavor and version there a number of methods to resolve this.
You can create a splunk specific config by creating a file at /etc/security/limits.d/ and name the file with a number higher than what exists there now, 90-splunk.conf e.g.

Or you add the limits directly to the start function in the init.d script as such:

cat /etc/init.d/splunk

splunk_start() {
  ulimit -Sn 64000
  ulimit -Hn 100000
  ulimit -Su 8192
  ulimit -Hu 16000
  echo Starting Splunk...
  "/opt/splunk/bin/splunk" start --no-prompt --answer-yes
  RETVAL=$?
  [ $RETVAL -eq 0 ] && touch /var/lock/subsys/splunk

In either case you'll need to cycle splunk in order to pick up the "new" limits.
You can also set them via systemd, but depending on your version of Splunk this can be a pain. I prefer to just drop them in the init.d script, it's proven the most reliable.

0 Karma

Motivator

There are a couple of methods to verify the ulimits took effect.

Check ulimits as splunk user

su - splunk
ulimit -a

Check via PID:

ps -ef |grep -i splunk (copy any of the PID's in the output)
cat /proc/splunk_pid/limits

Worth noting for you or your admin, setting ulimits via /etc/security/limits.conf is generally considered deprecated on RHEL/Centos 7.x (or any systemd based OS).

The preferred method is via conf files located at /etc/security/limits.d/
When the OS boots up, /etc/security/limits.conf is read first, then each file under /etc/security/limits.d/ is read sequentially and can/will override any previous files (with 99 being the highest).

Meaning, any limits set in /etc/security/limits.d/99-mylimits.conf will override all previous settings. I suspect something similar happened in your case.

0 Karma

Builder

@codebuilder , thank you for the detail reply!

ulimit -a from command line as splunk user shows the correct 64000 value, but splunkd.log on reboot shows that Splunk determined that ulimit -n is set to default 1024.

Did use "/etc/init.d/splunk" method previously.

What bothers me - that server was rebooted before as well, but ulimit -n value was still 64000 according to splunkd.log.
So why the sudden switch to ulimit -n 1024 default this time.

0 Karma

Motivator

Glad to help, and I fought the same issue myself previously, and only on reboots.

The underlying problem is that Splunk is running under init.d on a systemd system and limits are applied differently than the older init.d.

The sequence in which limits are read and applied by the kernel and process are out of sync on reboot so it falls back to the OS defaults.

You can solve it by creating a systemd unit file for splunk as it should technically be configured, but placing the limits in the startup script solved the issue for me.

Also, I would consider the cat /proc/splunk_pid/limits method as the definitive source of truth for what limits have been applied to the process. Hope this helps.

0 Karma

SplunkTrust
SplunkTrust

I think since you have support case with splunk, it would be good to take their advice, as they can review your config and server setup.

Having said that, large flat files, go through batch process/pipeline and it does take a while to see them at indexer/search head. Any chance of creating small files, may be at more frequent intervals, as opposed to one or two very large files in a day? Smaller files gets parsed /processed quickly and you should still be able to achieve the same expected results.

0 Karma

Builder

@lakshman239, best to my knowledge we cannot create smaller files. But I will verify that

0 Karma