Too many open files, despite ulimit of 65536

Isaac_Hailperin · ‎10-11-2017

My splunk indexer stopped with lots of ERROR messages which ended in "Too many open files". ulimit -n shows 65536 for the splunk user.
I was able to just start splunk again, and it is running fine now for two days. Also, when I count current open files with

lsof -u splunk | awk 'BEGIN { total = 0; } $4 ~ /^[0-9]/ { total += 1 } END { print total }'

it shows about 2600 open files for user splunk. So nothing serious. So obviously increasing the nofile limit would not help to prevent this in the futur. Any ideas on what could be done to further analyse this?

I am currently monitoring the number of open files over time. Maybe during crash time (0:03 am) nofiles was different form now in the afternoon.

Edit after comments:
OS is RedHat 7.3.

It is a cluster of two indexers, only one went down.

Number of forwarders, as estimated by | metadata type=hosts | where now()-recentTime < (7*24*60*60) |stats dc(host) is 1381.
Ulimit at startup, as determined by index=_internal host=myhost source="/opt/splunk/var/log/splunk/splunkd.log" Limit open files is 65536, as stated bevor.

My nofile over times shows a destinct increase about the time the server crashed, however still far from that 65536. However, this increase might already be the explanation, because the crash happened Monday morning, a time where a lot more scheduled searches might take place than on a Thursday.
Here is the graph:

codebuilder · ‎07-04-2019

ulimit settings can be configured and overridden in many places and by multiple processes.
Depending on if you are running under init.d or systemd, if you are setting ulimits via /etc/security/limits.conf vs /etc/security/limits.d/*.conf, setting directly in systemd unit file, and so on...

The ONLY source of 100% truth on what limits are currently applied to a running process is the kernel itself.
If those values do not match what you expect, then you will need to work backwards to determine where they are being set/modified.

Example method to query the kernel for the limits being applied:

First, find the PID of your Splunk process (your output will vary, and I've removed some lines from this example);

# ps -ef |grep -i splunk
root      34078      1  4 Jun18 ?        15:42:22 splunkd -p 8089 restart
root      34081  34078  0 Jun18 ?        00:06:55 [splunkd pid=34078] splunkd -p 8089 restart [process-runner]
root      70375  68047  0 13:19 pts/0    00:00:00 grep --color=auto -i splunk

In this example, 34081 is the PID of my running Splunk process.

Now, use that PID in order to find the limits that are currently applied to that process:

# cat /proc/34081/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             128616               128616               processes
Max open files            65536                65536                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       128616               128616               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Compare these values to what you are expecting to see based on configuration files.

----
An upvote would be appreciated and Accept Solution if it helps!

iamarkaprabha · ‎12-26-2018

Hi Isaac.Hailperin@lcsystems.ch ,

Did the issue resolve ?
I am facing same kind of issue

Isaac_Hailperin · ‎01-06-2019

@iamarkaprabha not really. A migration to a new index cluster was due anyway for other reasons, and the problem has not resurfaced since. Also, a newer Splunk version was used. I cannot tell you wich version we were using.

harish_l · ‎07-03-2019

I have set the ulimit is 64000. Can anyone please let me know what is maximum ulimit setting in splunk.

mattymo · ‎10-11-2017

what is your os?

ulimit may not reflect the ACTUAL values when splunk was run at startup, depending on how you set it. Just to be sure, try running this search and look for your last startup of splunkd:

index=_internal source=*splunkd.log ulimit

- MattyMo

s2_splunk · ‎10-11-2017

How many forwarders are connecting to your indexer? Is there only one indexer?

Too many open files, despite ulimit of 65536

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes