How does Splunk check for ulimit

mlevsh · ‎06-30-2017

Hopefully somebody can point us to a right direction:

We have multisite indexer cluster: two sites, 4 indexers per site (Splunk v. 6.5.3)
Few months ago, following Splunk's recommendations, we increased ulimit -n to a higher value 16384 on all indexers
for root and splunk user.
To make these changes persistent across reboots our Unix SAs added this to the bottom of /etc/security/limits.conf file:

splunkuser soft nofile 16384
splunkuser  hard nofile 16384
root soft nofile 16384
root hard nofile 16384

Running "ulimit -n" directly on all 8 servers returns an expected value: 16384.

Still when we run health check via Splunk Monitoring Console , it finds that all 4 servers on Site 1 have ulimits.open_files set to 4096 while ulimits.open_files set to 16384 on all servers on Site 2 .

How does Splunk check for ulimit and what might be a cause for this discrepancy?

guilmxm · ‎06-30-2017

Have a look at this post, you will see as well a variant solution of the first option. (init.d modification)

https://answers.splunk.com/answers/223838/why-are-my-ulimits-settings-not-being-respected-on.html

Once you have made your system modification, restart Splunk as a service and ensure by looking at splunkd.log that you the values you have set.

guilmxm · ‎06-30-2017

Hello,

This is indeed an annoying issue we have with recent Linux distributions using systemd, and that I hope to see fixed by Splunk someday.

To have Splunk using good ulimits when restarted as a service, you have a few simple options:

You can:

Modify /etc/init.d/splunk script on your servers, and replace within the start function the existing with:

su - splunk -c "/opt/bin/splunk start --no-prompt --answer-yes"

Which will start splunk with the same environment conditions than using CLI.

Notes: You need to adapt the user name if different, and the path to Splunk as well if not using default.

After the modification, you usually need to run:

systemctl daemon-reload

If your system is systemd compatible, then you can as well create a filelimit.conf service file:

/etc/systemd/system/splunk.service.d/filelimit.conf (adapt your values to whatever you want to set)

      [Service]
      LimitNOFILE=20240
      LimitNPROC=100000

And run:

systemctl daemon-reload

The first option is basically re-doing what Splunk was doing with enable boot-start option with old versions of Splunk.
In recent Splunk, a start as a service runs as root and Splunk itself spawns the processes under the good username defined in $SPLUNK_HOME/etc/splunk-launch.conf

The second option is perfectly fine as well if you have an OS using systemd.

I had an other example recently of similar issue with a customer using the pamd-tmp module in Ubuntu, and the first option has been required to fix Splunk from having in heritage the $TMP value from root. But that is another story.

Cheers,

Guilhem

mlevsh · ‎06-30-2017

@guilmxm , I'm so grateful for your detailed answer! Thank you.
I will have to contact our Unix SAs to see if our Linux distribution is systemd compatible before we select the right option for us. I will update my post with results.

mlevsh · ‎07-01-2017

@guilmxm , just one more question. Do you have any ideas, why servers on one site recognize good ulimits and all servers on other site - not?

guilmxm · ‎07-01-2017

@mlevsh

Well, when you start Splunk from a terminal under the Unix service account, splunkd process will have the proper ulimits you have set and can observe running "ulimits -a" in terminal.
Once started in CLI, a rolling restart for instance will keep as well the same ulimits.
The issue appears when Splunk is restarted as a system service at boot time or using init.d or service command.
So the sites might not have been started. (or restarted) the same way

Other options could be that this site has not the same init.d version (older version from previous Splunk release), or does not run the same Operating system version, or for some other reason is not exactly configured the same way and not affected.

To be sure of your settings on both sites, restart an instance in terminal and check splunkd logs (grep ulimit /opt/splunk/var/log/splunkd.log in terminal or search in index=_internal sourcetype=splunkd ulimit), after that restart the instance as root and as a service (service splunk restart) and again verify the ulimit in logs.

mlevsh · ‎07-07-2017

@guilmxm, thank you so much for this detailed explanation!
We haven't made changes ,you suggested in original answer, yet as it is a production system and it takes time to test and later go through normal process of modifying production servers.

One thing I wanted to bring up:

As you wrote and I've noticed myself as well, that if splunk process is running as "splunkd -p 8089 start" then we observe ulimit discrepancy.
But if its' running as "splunkd -p 8089 restart" then there is no discrepancy.

But majority of our Splunk servers are running as "splunk -p 8089 start" because it starts via enable-bootstart without user session.

diogofgm · ‎06-30-2017

Have you checked which user is running Splunk in site 1 instances?

------------
Hope I was able to help you. If so, some karma would be appreciated.

mlevsh · ‎06-30-2017

@diogofgm, it's the same user that is running Splunk on both sites

How does Splunk check for ulimit

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I

Are you a member of the Splunk Community?

How does Splunk check for ulimit

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I