Solved: Why is there significant sudden slowness in ingest...

mlevsh · ‎05-01-2019

Hi ,
Looking for an advice in troubleshooting the cause of the issue we are experiencing and how to solve it.

We have few Splunk UF(s) where we are monitoring large amount of big files to our 4 load balanced Heavy Forwarders.
The setup we have was working until last week when we have started to ingest the files with big delay ,3-6 hrs depending on the size. Previously it was taking minutes to ingest.

Best to our knowledge we didn't have any network, OS or Splunk related changes on the day when we started to experience the issue.

We tried:
1. Restart Splunk process on Splunk UF servers
2. Reboot the servers with Splunk UF
3. Per Splunk support we changed server.conf on Splunk UF server:
by adding parallelIngestionPipelines and queue sizes

parallelIngestionPipelines = 2
[queue]
maxSize = 1GB
[queue=aq]
maxSize = 20MB
[queue=aeq]
maxSize = 20MB

Per Splunk support we modified limits.conf
by adding max_fd and we had thruput set to unlimited already
```
[thruput]
maxKBps = 0
[inputproc]
max_fd = 200
```
All above didn't fix the issue.
Maybe you have experienced the similar issue. It would be great to know how it was solved
Any advice will be appreciated!

mlevsh · ‎05-14-2019

@codebuilder , @woodcock , @lakshman239
Just an update how the issue was solved in our case.
After ruling out that the cause was any Splunk issue/configuration , our network engineer made some configuration changes across replication between different data centers in addition to the change of wan routing preference.

View solution in original post

mlevsh · ‎05-14-2019

@codebuilder , @woodcock , @lakshman239
Just an update how the issue was solved in our case.
After ruling out that the cause was any Splunk issue/configuration , our network engineer made some configuration changes across replication between different data centers in addition to the change of wan routing preference.

woodcock · ‎05-14-2019

Ah, so slow network. Yes, that will kill things. Please do click Accept on your answer here to close the question.

woodcock · ‎05-01-2019

We see this problem all the time and it is usually due to there being way too many files co-resident with the files that you are monitoring. This typically happens because there is no housekeeping, or very languishing policy for deleting the files as they rotate. Yes, even if you are not monitoring the rotated files, they will eventually slow the forwarder down to a crawl. It usually starts when you have hundreds of files and you are crippled by the time you get to thousands. If you cannot delete the files that are way old and done, then you can create soft links to fresh files in another directory. Let me know if you need details on how to do that.

aldi_mukti · ‎11-19-2023

Hi @woodcock

Please tell me how to do this configuration

How long and whether we can set how long the log is kept ?

codebuilder · ‎05-01-2019

When you added parallelIngestionPipelines to server.conf on the forwarders, did you also update the indexers? The default value is 1, so increasing the value on the forwarders without increasing it on the indexers will gain you no performance increase.

Also, have you checked the ulimit settings for the Splunk user and/or daemon? If not, you may want to check those, especially the open files limit. The OS default is generally 1024, which is way too low for Splunk.

----
An upvote would be appreciated and Accept Solution if it helps!

mlevsh · ‎05-01-2019

@codebuilder,

1) All our indexers are on the Splunk Cloud so we don't have access to it. Have to check what the parallelIngestionPipelines value is with Cloud Support

2) Regarding the ulimit value for open files. During first few days of the issue "ulimit -n" was showing as set to 64000 on Splunk UF server.
At some point we rebooted it and it went down to 1024 after reboot for some reason.
Per our Unix sys admin it is set to 64000 in /etc/security/limits.conf on system level for our splunk user

codebuilder · ‎05-01-2019

I suspected that may be the case. Your ulimit configurations were not honored and reverted back to the OS defaults upon reboot (expected behavior).

Depending on your OS flavor and version there a number of methods to resolve this.
You can create a splunk specific config by creating a file at /etc/security/limits.d/ and name the file with a number higher than what exists there now, 90-splunk.conf e.g.

Or you add the limits directly to the start function in the init.d script as such:

cat /etc/init.d/splunk

splunk_start() {
  ulimit -Sn 64000
  ulimit -Hn 100000
  ulimit -Su 8192
  ulimit -Hu 16000
  echo Starting Splunk...
  "/opt/splunk/bin/splunk" start --no-prompt --answer-yes
  RETVAL=$?
  [ $RETVAL -eq 0 ] && touch /var/lock/subsys/splunk

In either case you'll need to cycle splunk in order to pick up the "new" limits.
You can also set them via systemd, but depending on your version of Splunk this can be a pain. I prefer to just drop them in the init.d script, it's proven the most reliable.

----
An upvote would be appreciated and Accept Solution if it helps!

codebuilder · ‎05-01-2019

There are a couple of methods to verify the ulimits took effect.

Check ulimits as splunk user

su - splunk
ulimit -a

Check via PID:

ps -ef |grep -i splunk (copy any of the PID's in the output)
cat /proc/splunk_pid/limits

Worth noting for you or your admin, setting ulimits via /etc/security/limits.conf is generally considered deprecated on RHEL/Centos 7.x (or any systemd based OS).

The preferred method is via conf files located at /etc/security/limits.d/
When the OS boots up, /etc/security/limits.conf is read first, then each file under /etc/security/limits.d/ is read sequentially and can/will override any previous files (with 99 being the highest).

Meaning, any limits set in /etc/security/limits.d/99-mylimits.conf will override all previous settings. I suspect something similar happened in your case.

----
An upvote would be appreciated and Accept Solution if it helps!

mlevsh · ‎05-01-2019

@codebuilder , thank you for the detail reply!

ulimit -a from command line as splunk user shows the correct 64000 value, but splunkd.log on reboot shows that Splunk determined that ulimit -n is set to default 1024.

Did use "/etc/init.d/splunk" method previously.

What bothers me - that server was rebooted before as well, but ulimit -n value was still 64000 according to splunkd.log.
So why the sudden switch to ulimit -n 1024 default this time.

codebuilder · ‎05-01-2019

Glad to help, and I fought the same issue myself previously, and only on reboots.

The underlying problem is that Splunk is running under init.d on a systemd system and limits are applied differently than the older init.d.

The sequence in which limits are read and applied by the kernel and process are out of sync on reboot so it falls back to the OS defaults.

You can solve it by creating a systemd unit file for splunk as it should technically be configured, but placing the limits in the startup script solved the issue for me.

Also, I would consider the cat /proc/splunk_pid/limits method as the definitive source of truth for what limits have been applied to the process. Hope this helps.

----
An upvote would be appreciated and Accept Solution if it helps!

lakshman239 · ‎05-01-2019

I think since you have support case with splunk, it would be good to take their advice, as they can review your config and server setup.

Having said that, large flat files, go through batch process/pipeline and it does take a while to see them at indexer/search head. Any chance of creating small files, may be at more frequent intervals, as opposed to one or two very large files in a day? Smaller files gets parsed /processed quickly and you should still be able to achieve the same expected results.

mlevsh · ‎05-01-2019

@lakshman239, best to my knowledge we cannot create smaller files. But I will verify that

Why is there significant sudden slowness in ingesting between servers with Splunk UF to Splunk Heavy Forwarders?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Best Practices: Splunk auto adjust pipeline queue

Laser Bananas and Edge Hubs: Exploring Operational Technology (OT) Data Through a ...

Event Series: Mastering AI Tokenomics and Splunk Agent Observability

Join the Conversation

Why is there significant sudden slowness in ingesting between servers with Splunk UF to Splunk Heavy Forwarders?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Best Practices: Splunk auto adjust pipeline queue

Laser Bananas and Edge Hubs: Exploring Operational Technology (OT) Data Through a ...

Event Series: Mastering AI Tokenomics and Splunk Agent Observability