Monitoring Splunk

Anyone know what a STMgr "out of memory failure" is about?

Lowell
Super Champion

Last night I started seeing a massive flood of errors like this in my splunkd.log on my central indexer. Does anyone know what these mean exactly, and what would cause this type of problem?

11-18-2010 09:10:02.380 ERROR STMgr - dir='/opt/splunk/var/lib/splunk/defaultdb/db/hot_v1_121' out of memory failure rc=1 warm_rc[-2,12] from st_txn_start
11-18-2010 09:10:02.381 ERROR StreamGroup - unexpected rc=1 from IndexableValue->index
11-18-2010 09:10:02.384 ERROR STMgr - dir='/opt/splunk/var/lib/splunk/defaultdb/db/hot_v1_121' out of memory failure rc=1 warm_rc[-2,12] from st_txn_start
11-18-2010 09:10:02.384 ERROR StreamGroup - unexpected rc=1 from IndexableValue->index
11-18-2010 09:10:02.396 ERROR STMgr - dir='/opt/splunk/var/lib/splunk/defaultdb/db/hot_v1_121' out of memory failure rc=1 warm_rc[-2,12] from st_txn_start
11-18-2010 09:10:02.396 ERROR StreamGroup - unexpected rc=1 from IndexableValue->index

This seems to have caused a large number of events to be dropped. Based on some really weak calculations, I'd say that at some point when this occurred, I lost upwards of %50 of the events I was expecting. But I'm not sure how to calculate this well. I have several processes that do periodic polling (like once a minute), and for some hours, I only see around 30 events; so that's where I'm estimating the 50% loss.

At first I thought this only effected a single bucket (the host bucket for event with current timestamps for the main index), but after some additional search, it appears that all of the indexes were affected so it doesn't seem to be a problem with a single index.

I ran this search:

index=_internal sourcetype=splunkd ERROR STMgr "out of memory failure" | stats count by dir, rc | sort -count

It seems that these frequency of errors seem to be proportional to the number of events each index receives. Also, the return code (rc) is always 1.


Additional info:

I'm running Splunk 4.1.5 on Ubuntu 8.04 (32 bit). I don't see any weird kernel messages about memory issues on the box.

I restarted "splunk" and the problem seems to have gone away; at least for now.


Update::

On a second look, this does seem to be a memory related. I'm thinking the core issue is a memory leak in splunkd. The process is gaining about 5Mb per hour on my system, and with a 32 bit OS, that means that I'll run into the OS 2Gb per process limit after about 2 and half weeks.

So now I'm looking for a few volunteers! Can some of you check your splunkd memory usage and report back your findings. This search only works if you've enabled the unix splunk application (specifically using the "ps" output).

host=<<YOUR.SPLUNK.INEXER>> sourcetype=ps splunkd
| multikv fields PID pctCPU pctMEM RSZ_KB VSZ_KB COMMAND
| search COMMAND="splunkd" NOT search pctMEM>0
| eval cmd=COMMAND."[".PID."]"
| timechart span=1d limit=30 median(eval(RSZ_KB/1024)) as RSZ_MB by cmd

Or, on Windows you could try:

host=<<SPLUNKHOST>> sourcetype=WMI:*Processes "Name=splunkd"
| eval proc=Name."[".IDProcess."]" 
| eval mb=PrivateBytes/1024/1024
| timechart span=6h avg(mb) as MemMb by proc

Run this for at least a 24 hour window, or a week if you want to see a longer trend. Let me know what you find, thanks!

Tags (2)
0 Karma

Bryce_Kahle
New Member

I'm also getting this on 4.2.1 on Windows. The memory leak seemed to start after I upgraded from 4.1. Here is the graph of the search posted above for the last 30 days. Splunk Memory Leak

0 Karma

mntbighker
Path Finder

I'm getting this too with 4.1.7 on CentOS. I can get splunkd to crash just by going to the index health page. I can restart splunkd but going to the health page will reliably crash it again.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...