Hopefully somebody can point us to a right direction:
We have multisite indexer cluster: two sites, 4 indexers per site (Splunk v. 6.5.3)
Few months ago, following Splunk's recommendations, we increased ulimit -n to a higher value 16384 on all indexers
for root and splunk user.
To make these changes persistent across reboots our Unix SAs added this to the bottom of /etc/security/limits.conf file:
splunkuser soft nofile 16384 splunkuser hard nofile 16384 root soft nofile 16384 root hard nofile 16384
Running "ulimit -n" directly on all 8 servers returns an expected value:
Still when we run health check via Splunk Monitoring Console , it finds that all 4 servers on Site 1 have ulimits.open_files set to 4096 while ulimits.open_files set to 16384 on all servers on Site 2 .
How does Splunk check for ulimit and what might be a cause for this discrepancy?
Have a look at this post, you will see as well a variant solution of the first option. (init.d modification)
Once you have made your system modification, restart Splunk as a service and ensure by looking at splunkd.log that you the values you have set.
This is indeed an annoying issue we have with recent Linux distributions using systemd, and that I hope to see fixed by Splunk someday.
To have Splunk using good ulimits when restarted as a service, you have a few simple options:
Modify /etc/init.d/splunk script on your servers, and replace within the start function the existing with:
su - splunk -c "/opt/bin/splunk start --no-prompt --answer-yes"
Which will start splunk with the same environment conditions than using CLI.
Notes: You need to adapt the user name if different, and the path to Splunk as well if not using default.
After the modification, you usually need to run:
/etc/systemd/system/splunk.service.d/filelimit.conf (adapt your values to whatever you want to set)
[Service] LimitNOFILE=20240 LimitNPROC=100000
The first option is basically re-doing what Splunk was doing with enable boot-start option with old versions of Splunk.
In recent Splunk, a start as a service runs as root and Splunk itself spawns the processes under the good username defined in $SPLUNK_HOME/etc/splunk-launch.conf
The second option is perfectly fine as well if you have an OS using systemd.
I had an other example recently of similar issue with a customer using the pamd-tmp module in Ubuntu, and the first option has been required to fix Splunk from having in heritage the $TMP value from root. But that is another story.
@guilmxm , I'm so grateful for your detailed answer! Thank you.
I will have to contact our Unix SAs to see if our Linux distribution is systemd compatible before we select the right option for us. I will update my post with results.
Well, when you start Splunk from a terminal under the Unix service account, splunkd process will have the proper ulimits you have set and can observe running "ulimits -a" in terminal.
Once started in CLI, a rolling restart for instance will keep as well the same ulimits.
The issue appears when Splunk is restarted as a system service at boot time or using init.d or service command.
So the sites might not have been started. (or restarted) the same way
Other options could be that this site has not the same init.d version (older version from previous Splunk release), or does not run the same Operating system version, or for some other reason is not exactly configured the same way and not affected.
To be sure of your settings on both sites, restart an instance in terminal and check splunkd logs (grep ulimit /opt/splunk/var/log/splunkd.log in terminal or search in index=_internal sourcetype=splunkd ulimit), after that restart the instance as root and as a service (service splunk restart) and again verify the ulimit in logs.
@guilmxm, thank you so much for this detailed explanation!
We haven't made changes ,you suggested in original answer, yet as it is a production system and it takes time to test and later go through normal process of modifying production servers.
One thing I wanted to bring up:
As you wrote and I've noticed myself as well, that if splunk process is running as "splunkd -p 8089 start" then we observe ulimit discrepancy.
But if its' running as "splunkd -p 8089 restart" then there is no discrepancy.
But majority of our Splunk servers are running as "splunk -p 8089 start" because it starts via enable-bootstart without user session.