I have a search head running splunk 6.2.3, in a non-clustered distributed environment, which is sporadically having the Linux OOM killer cause the splunkd process to crash. I have looked into the known THP issue and that isn't the cause (the particular type of VM it's on doesn't support THP). There is no rhyme or reason as to why it is forced to crash. Sometimes is takes 1 hour, sometimes it takes 3 hours. The VM is question is fairly robust memory wise (30 GB). I'm also kinda new to splunk so I'm hoping there is a simple fix that i haven't run across yet while crawling through here. Any and all help would be appreciated.
Unfortunately "An important question is which splunkd processes" is always the main splunkd since it is the process and the rest are threads ... so OOM is a nasty problem.
I will share one issue we have seen with low memory ( 32GB) hosts and many concurrent search jobs. limits.conf defaults say to check 5 process runners and if none are available start another but keep them around for 7200 seconds each one allocating a variable chunk of memory. Run dmesg on your crashed box and see at the moment of the crash how many threads were running. In our case we has a few hundred so we trimmed up the timelines and expanded the limit from 5 to 20. This worked wonders in eliminating the frequent crashes and the possible bucket corruptions that follow ! The total number of search runners spiking only briefly as needed.
-- much of the large search load 7-8GB memory / search were in the data model accelerations, performance and another ITSI model I can't recall -- but this is very dependant on YOUR site event volumes.
Default of 3 concurrent accelerations / model / sh entity can create a large background load of searches -- BIG searches ....
One of your searches is doing something crazy. Can you post your saved searches that kick off around the time of the failure? It may have a huge lookup file it is using or maybe you're joining 20 times and you've disabled the default limits.
Being pedantic, THP is a feature of the kernel you're running, not of the VM itself. That said, if your kernel does not support THP then it's not an issue. Also, THP won't make your box use more memory - it will just make it use VASTLY more CPU doing memory management work. (Imagine a background process doing a 'defrag' on memory all the time, and the extra CPU that would need.)
An important question is which splunkd processes are being killed by OOMKiller. There is the "main" process, which is a long-running daemon, and there are (hopefully) short-lived search-runner processes for each concurrent search. These two types should exhibit different memory usage patterns.
If the main splunkd is the one being killed all the time, there's probably a memory leak somewhere and your configuration is irritating it. If a search process is being killed, then it's likely a feature of a search you're running, like large cardinality in a stats operation or similar.
In the snippet of log you posted, OOMKiller killed a splunk process that had a virtual process size of 32,101,400kB (or about 32 GB) and a resident memory usage of 29,769,432kB (or about 29GB). That's awfully close to your "30 GB size" of your VM.
Since it's a VM, you can (in theory) keep shoveling coal into the firebox by adding RAM until this stops. While "12GB'" is considered the "current reference hardware" for a (non-ES) search head (http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware) sometimes, depending on your search particulars much much more can be needed.
To debug this you may need to temporarily add RAM in order to keep it stable until you can figure out where the leak is or if it's related to a specific search. I would engage support as they have the troubleshooting tools for this type of problem.
Also, 6.2.3 is OLD by now. Consider upgrading to the latest 6.2, or even 6.3.3! Your problem may be fixed there.
I would start by checking for any errors / warnings found in index=_internal
which occur around the time of the outage. That may give you the proper solution in itself. If not, then please post any of these errors/warnings you find and we'll be able to decipher them for you.
No errors/warnings posted in the _internal index in the previous 5 minutes leading up to the OOM force close, just INFO level metrics logs. Here is the process usage printout from /var/log/messages when SplunkD invoked the OOM killer:
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 1130] 0 1130 2735 101 11 3 0 -1000 udevd
[ 1231] 0 1231 2734 96 10 3 0 -1000 udevd
[ 1233] 0 1233 2734 102 10 3 0 -1000 udevd
[ 1567] 0 1567 2340 123 10 3 0 0 dhclient
[ 1608] 0 1608 28024 114 24 3 0 -1000 auditd
[ 1626] 0 1626 61894 634 24 4 0 0 rsyslogd
[ 1637] 0 1637 3459 71 10 3 0 0 irqbalance
[ 1648] 81 1648 5448 59 14 3 0 0 dbus-daemon
[ 1732] 0 1732 19454 204 39 3 0 -1000 sshd
[ 1749] 38 1749 7321 144 19 3 0 0 ntpd
[ 1764] 0 1764 22240 461 45 3 0 0 sendmail
[ 1772] 51 1772 20104 367 41 3 0 0 sendmail
[ 2078] 0 2078 29879 158 15 3 0 0 crond
[ 2088] 0 2088 4267 40 12 3 0 0 atd
[ 2114] 0 2114 1615 31 9 3 0 0 agetty
[ 2115] 0 2115 1078 24 8 3 0 0 mingetty
[ 2119] 0 2119 1078 23 8 3 0 0 mingetty
[ 2122] 0 2122 1078 23 8 4 0 0 mingetty
[ 2124] 0 2124 1078 23 8 3 0 0 mingetty
[ 2126] 0 2126 1078 24 8 3 0 0 mingetty
[ 2128] 0 2128 1078 24 8 3 0 0 mingetty
[20972] 0 20972 8025350 7442358 15342 34 0 0 splunkd
[20973] 0 20973 14413 1135 25 3 0 -1000 splunkd
[20987] 0 20987 63977 7339 58 3 0 0 mongod
[21065] 0 21065 525407 43490 217 5 0 0 python
[21088] 0 21088 24650 2038 48 3 0 0 splunkd
[26700] 0 26700 42073 20408 76 3 0 0 splunkd
[26703] 0 26703 14415 1082 21 3 0 -1000 splunkd
Out of memory: Kill process 20972 (splunkd) score 938 or sacrifice child
Killed process 20972 (splunkd) total-vm:32101400kB, anon-rss:29769432kB, file-rss:0kB