My splunk indexer stopped with lots of ERROR messages which ended in "Too many open files". ulimit -n shows 65536 for the splunk user.
I was able to just start splunk again, and it is running fine now for two days. Also, when I count current open files with
lsof -u splunk | awk 'BEGIN { total = 0; } $4 ~ /^[0-9]/ { total += 1 } END { print total }'
it shows about 2600 open files for user splunk. So nothing serious. So obviously increasing the nofile limit would not help to prevent this in the futur. Any ideas on what could be done to further analyse this?
I am currently monitoring the number of open files over time. Maybe during crash time (0:03 am) nofiles was different form now in the afternoon.
Edit after comments:
OS is RedHat 7.3.
It is a cluster of two indexers, only one went down.
Number of forwarders, as estimated by | metadata type=hosts | where now()-recentTime < (7*24*60*60) |stats dc(host)
is 1381.
Ulimit at startup, as determined by index=_internal host=myhost source="/opt/splunk/var/log/splunk/splunkd.log" Limit open files
is 65536, as stated bevor.
My nofile over times shows a destinct increase about the time the server crashed, however still far from that 65536. However, this increase might already be the explanation, because the crash happened Monday morning, a time where a lot more scheduled searches might take place than on a Thursday.
Here is the graph:
ulimit settings can be configured and overridden in many places and by multiple processes.
Depending on if you are running under init.d or systemd, if you are setting ulimits via /etc/security/limits.conf vs /etc/security/limits.d/*.conf, setting directly in systemd unit file, and so on...
The ONLY source of 100% truth on what limits are currently applied to a running process is the kernel itself.
If those values do not match what you expect, then you will need to work backwards to determine where they are being set/modified.
Example method to query the kernel for the limits being applied:
First, find the PID of your Splunk process (your output will vary, and I've removed some lines from this example);
# ps -ef |grep -i splunk
root 34078 1 4 Jun18 ? 15:42:22 splunkd -p 8089 restart
root 34081 34078 0 Jun18 ? 00:06:55 [splunkd pid=34078] splunkd -p 8089 restart [process-runner]
root 70375 68047 0 13:19 pts/0 00:00:00 grep --color=auto -i splunk
In this example, 34081 is the PID of my running Splunk process.
Now, use that PID in order to find the limits that are currently applied to that process:
# cat /proc/34081/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 128616 128616 processes
Max open files 65536 65536 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 128616 128616 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Compare these values to what you are expecting to see based on configuration files.
Hi Isaac.Hailperin@lcsystems.ch ,
Did the issue resolve ?
I am facing same kind of issue
@iamarkaprabha not really. A migration to a new index cluster was due anyway for other reasons, and the problem has not resurfaced since. Also, a newer Splunk version was used. I cannot tell you wich version we were using.
I have set the ulimit is 64000. Can anyone please let me know what is maximum ulimit setting in splunk.
what is your os?
ulimit may not reflect the ACTUAL values when splunk was run at startup, depending on how you set it. Just to be sure, try running this search and look for your last startup of splunkd:
index=_internal source=*splunkd.log ulimit
How many forwarders are connecting to your indexer? Is there only one indexer?