<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: SplunkD Causing Linux OOM Condition in Monitoring Splunk</title>
    <link>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248473#M2423</link>
    <description>&lt;P&gt;Being pedantic, THP is a feature of the kernel you're running, not of the VM itself.  That said, if your kernel does not support THP then it's not an issue.  Also, THP won't make your box use more memory - it will just make it use VASTLY more CPU doing memory management work.  (Imagine a background process doing a 'defrag' on memory all the time, and the extra CPU that would need.)&lt;/P&gt;

&lt;P&gt;An important question is &lt;STRONG&gt;which&lt;/STRONG&gt; splunkd processes are being killed by OOMKiller.   There is the "main" process, which is a long-running daemon, and there are (hopefully) short-lived search-runner processes for each concurrent search.  These two types should exhibit different memory usage patterns.  &lt;/P&gt;

&lt;P&gt;If the main splunkd is the one being killed all the time, there's probably a memory leak somewhere and your configuration is irritating it.  If a search process is being killed, then it's likely a feature of a search you're running, like large cardinality in a stats operation or similar.  &lt;/P&gt;

&lt;P&gt;In the snippet of log you posted, OOMKiller killed a splunk process that had a virtual process size of 32,101,400kB (or about 32 GB) and a resident memory usage of 29,769,432kB (or about 29GB).  That's awfully close to your "30 GB size" of your VM.&lt;/P&gt;

&lt;P&gt;Since it's a VM, you can (in theory) keep shoveling coal into the firebox by adding RAM until this stops.   While "12GB'" is considered the "current reference hardware" for a (non-ES) search head  (&lt;A href="http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware"&gt;http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware&lt;/A&gt;) sometimes, depending on your search particulars much much more can be needed.&lt;/P&gt;

&lt;P&gt;To debug this you may need to temporarily add RAM in order to keep it stable  until you can figure out where the leak is or if it's related to a specific search.  I would engage support as they have the troubleshooting tools for this type of problem.&lt;/P&gt;

&lt;P&gt;Also, 6.2.3 is OLD by now.  Consider upgrading to the latest 6.2, or even 6.3.3!  Your problem may be fixed there.&lt;/P&gt;</description>
    <pubDate>Wed, 16 Mar 2016 08:27:44 GMT</pubDate>
    <dc:creator>dwaddle</dc:creator>
    <dc:date>2016-03-16T08:27:44Z</dc:date>
    <item>
      <title>SplunkD Causing Linux OOM Condition</title>
      <link>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248470#M2420</link>
      <description>&lt;P&gt;I have a search head running splunk 6.2.3, in a non-clustered distributed environment, which is sporadically having the Linux OOM killer cause the splunkd process to crash. I have looked into the known THP issue and that isn't the cause (the particular type of VM it's on doesn't support THP). There is no rhyme or reason as to why it is forced to crash. Sometimes is takes 1 hour, sometimes it takes 3 hours. The VM is question is fairly robust  memory wise (30 GB). I'm also kinda new to splunk so I'm hoping there is a simple fix that i haven't run across yet while crawling through here. Any and all help would be appreciated.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Mar 2016 13:25:31 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248470#M2420</guid>
      <dc:creator>tccooper</dc:creator>
      <dc:date>2016-03-15T13:25:31Z</dc:date>
    </item>
    <item>
      <title>Re: SplunkD Causing Linux OOM Condition</title>
      <link>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248471#M2421</link>
      <description>&lt;P&gt;I would start by checking for any errors / warnings found in &lt;CODE&gt;index=_internal&lt;/CODE&gt; which occur around the time of the outage.  That may give you the proper solution in itself.  If not, then please post any of these errors/warnings you find and we'll be able to decipher them for you.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Mar 2016 14:22:19 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248471#M2421</guid>
      <dc:creator>jkat54</dc:creator>
      <dc:date>2016-03-15T14:22:19Z</dc:date>
    </item>
    <item>
      <title>Re: SplunkD Causing Linux OOM Condition</title>
      <link>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248472#M2422</link>
      <description>&lt;P&gt;No errors/warnings posted in the _internal index in the previous 5 minutes leading up to the OOM force close, just INFO level metrics logs. Here is the process usage printout from /var/log/messages when SplunkD invoked the OOM killer:&lt;/P&gt;

&lt;P&gt;[   pid ]   uid tgid    total_vm    rss nr_ptes nr_pmds swapents    oom_score_adj   name&lt;BR /&gt;
[   1130]   0   1130    2735    101 11  3   0   -1000   udevd&lt;BR /&gt;&lt;BR /&gt;
[   1231]   0   1231    2734    96  10  3   0   -1000   udevd&lt;BR /&gt;&lt;BR /&gt;
[   1233]   0   1233    2734    102 10  3   0   -1000   udevd&lt;BR /&gt;&lt;BR /&gt;
[   1567]   0   1567    2340    123 10  3   0   0   dhclient&lt;BR /&gt;&lt;BR /&gt;
[   1608]   0   1608    28024   114 24  3   0   -1000   auditd&lt;BR /&gt;&lt;BR /&gt;
[   1626]   0   1626    61894   634 24  4   0   0   rsyslogd&lt;BR /&gt;&lt;BR /&gt;
[   1637]   0   1637    3459    71  10  3   0   0   irqbalance&lt;BR /&gt;&lt;BR /&gt;
[   1648]   81  1648    5448    59  14  3   0   0   dbus-daemon &lt;BR /&gt;
[   1732]   0   1732    19454   204 39  3   0   -1000   sshd&lt;BR /&gt;&lt;BR /&gt;
[   1749]   38  1749    7321    144 19  3   0   0   ntpd&lt;BR /&gt;&lt;BR /&gt;
[   1764]   0   1764    22240   461 45  3   0   0   sendmail&lt;BR /&gt;&lt;BR /&gt;
[   1772]   51  1772    20104   367 41  3   0   0   sendmail&lt;BR /&gt;&lt;BR /&gt;
[   2078]   0   2078    29879   158 15  3   0   0   crond&lt;BR /&gt;&lt;BR /&gt;
[   2088]   0   2088    4267    40  12  3   0   0   atd &lt;BR /&gt;
[   2114]   0   2114    1615    31  9   3   0   0   agetty&lt;BR /&gt;&lt;BR /&gt;
[   2115]   0   2115    1078    24  8   3   0   0   mingetty&lt;BR /&gt;&lt;BR /&gt;
[   2119]   0   2119    1078    23  8   3   0   0   mingetty&lt;BR /&gt;&lt;BR /&gt;
[   2122]   0   2122    1078    23  8   4   0   0   mingetty&lt;BR /&gt;&lt;BR /&gt;
[   2124]   0   2124    1078    23  8   3   0   0   mingetty&lt;BR /&gt;&lt;BR /&gt;
[   2126]   0   2126    1078    24  8   3   0   0   mingetty&lt;BR /&gt;&lt;BR /&gt;
[   2128]   0   2128    1078    24  8   3   0   0   mingetty&lt;BR /&gt;&lt;BR /&gt;
[20972] 0   20972   8025350 7442358 15342   34  0   0   splunkd&lt;BR /&gt;&lt;BR /&gt;
[20973] 0   20973   14413   1135    25  3   0   -1000   splunkd&lt;BR /&gt;&lt;BR /&gt;
[20987] 0   20987   63977   7339    58  3   0   0   mongod&lt;BR /&gt;&lt;BR /&gt;
[21065] 0   21065   525407  43490   217 5   0   0   python&lt;BR /&gt;&lt;BR /&gt;
[21088] 0   21088   24650   2038    48  3   0   0   splunkd&lt;BR /&gt;&lt;BR /&gt;
[26700] 0   26700   42073   20408   76  3   0   0   splunkd&lt;BR /&gt;&lt;BR /&gt;
[26703] 0   26703   14415   1082    21  3   0   -1000   splunkd&lt;BR /&gt;&lt;BR /&gt;
Out of  memory: Kill    process 20972   (splunkd)   score   938 or  sacrifice   child&lt;BR /&gt;
Killed  process 20972   (splunkd)   total-vm:32101400kB,    anon-rss:29769432kB,    file-rss:0kB                    &lt;/P&gt;</description>
      <pubDate>Tue, 29 Sep 2020 09:06:20 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248472#M2422</guid>
      <dc:creator>tccooper</dc:creator>
      <dc:date>2020-09-29T09:06:20Z</dc:date>
    </item>
    <item>
      <title>Re: SplunkD Causing Linux OOM Condition</title>
      <link>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248473#M2423</link>
      <description>&lt;P&gt;Being pedantic, THP is a feature of the kernel you're running, not of the VM itself.  That said, if your kernel does not support THP then it's not an issue.  Also, THP won't make your box use more memory - it will just make it use VASTLY more CPU doing memory management work.  (Imagine a background process doing a 'defrag' on memory all the time, and the extra CPU that would need.)&lt;/P&gt;

&lt;P&gt;An important question is &lt;STRONG&gt;which&lt;/STRONG&gt; splunkd processes are being killed by OOMKiller.   There is the "main" process, which is a long-running daemon, and there are (hopefully) short-lived search-runner processes for each concurrent search.  These two types should exhibit different memory usage patterns.  &lt;/P&gt;

&lt;P&gt;If the main splunkd is the one being killed all the time, there's probably a memory leak somewhere and your configuration is irritating it.  If a search process is being killed, then it's likely a feature of a search you're running, like large cardinality in a stats operation or similar.  &lt;/P&gt;

&lt;P&gt;In the snippet of log you posted, OOMKiller killed a splunk process that had a virtual process size of 32,101,400kB (or about 32 GB) and a resident memory usage of 29,769,432kB (or about 29GB).  That's awfully close to your "30 GB size" of your VM.&lt;/P&gt;

&lt;P&gt;Since it's a VM, you can (in theory) keep shoveling coal into the firebox by adding RAM until this stops.   While "12GB'" is considered the "current reference hardware" for a (non-ES) search head  (&lt;A href="http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware"&gt;http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware&lt;/A&gt;) sometimes, depending on your search particulars much much more can be needed.&lt;/P&gt;

&lt;P&gt;To debug this you may need to temporarily add RAM in order to keep it stable  until you can figure out where the leak is or if it's related to a specific search.  I would engage support as they have the troubleshooting tools for this type of problem.&lt;/P&gt;

&lt;P&gt;Also, 6.2.3 is OLD by now.  Consider upgrading to the latest 6.2, or even 6.3.3!  Your problem may be fixed there.&lt;/P&gt;</description>
      <pubDate>Wed, 16 Mar 2016 08:27:44 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248473#M2423</guid>
      <dc:creator>dwaddle</dc:creator>
      <dc:date>2016-03-16T08:27:44Z</dc:date>
    </item>
    <item>
      <title>Re: SplunkD Causing Linux OOM Condition</title>
      <link>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248474#M2424</link>
      <description>&lt;P&gt;One of your searches is doing something crazy.  Can you post your saved searches that kick off around the time of the failure?  It may have a huge lookup file it is using or maybe you're joining 20 times and you've disabled the default limits.  &lt;/P&gt;</description>
      <pubDate>Wed, 16 Mar 2016 11:11:23 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248474#M2424</guid>
      <dc:creator>jkat54</dc:creator>
      <dc:date>2016-03-16T11:11:23Z</dc:date>
    </item>
    <item>
      <title>Re: SplunkD Causing Linux OOM Condition</title>
      <link>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248475#M2425</link>
      <description>&lt;P&gt;Unfortunately "An important question is which splunkd processes"  is always the main splunkd since it is the process and the rest are threads ... so OOM is a nasty problem.&lt;/P&gt;

&lt;P&gt;I will share one issue we have seen with low memory ( 32GB) hosts and many concurrent search jobs. limits.conf defaults say to check 5 process runners and if none are available start another but keep them around for 7200 seconds each one allocating a variable chunk of memory. Run dmesg on your crashed box and see at the moment of the crash how many threads were running. In our case we has a few hundred so we trimmed up the timelines and expanded the limit from 5 to 20. This worked wonders in eliminating the frequent crashes and the possible bucket corruptions that follow ! The total number of search runners spiking only briefly as needed.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Nov 2017 18:26:36 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248475#M2425</guid>
      <dc:creator>mwk1000</dc:creator>
      <dc:date>2017-11-07T18:26:36Z</dc:date>
    </item>
    <item>
      <title>Re: SplunkD Causing Linux OOM Condition</title>
      <link>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248476#M2426</link>
      <description>&lt;P&gt;-- much of the large search load 7-8GB memory / search were in the data model accelerations, performance and another ITSI model I can't recall --  but this is very dependant on YOUR site event volumes.&lt;/P&gt;

&lt;P&gt;Default of 3 concurrent accelerations / model / sh entity can create a large background load of searches  -- BIG searches ....&lt;/P&gt;</description>
      <pubDate>Tue, 07 Nov 2017 18:32:10 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Monitoring-Splunk/SplunkD-Causing-Linux-OOM-Condition/m-p/248476#M2426</guid>
      <dc:creator>mwk1000</dc:creator>
      <dc:date>2017-11-07T18:32:10Z</dc:date>
    </item>
  </channel>
</rss>

