I suspect scheduled searches are either the cause or a symptom of the cause of splunkd using in the neighborhood of 13G RAM and almost 60G swap. There are no new searches, and this happens several times per day. The web UI becomes useless as it times out talking to splunkd, but everything under the hood works eventually and then the server load returns to normal, as does splunkd's memory footprint.
I disabled the scheduler to see if this would fix it, but I won't know until at least tomorrow.
How can I narrow down what could be causing this? Any hints of things to look for in the logs or processes to strace etc? The server runs on a linux box.
Thanks,
Jesse
Clarification: I specifically meant the main splunkd process with my reference to splunkd above. No other process looks like a memory or CPU problem directly using ps, top, etc.
It turns out that splunkd is generating > 20k threads, so really this is a problem about something other than RAM.
This is found with: ps -Lef
It turns out that splunkd is generating > 20k threads, so really this is a problem about something other than RAM.
This is found with: ps -Lef
It was version 5.0.2 and 5.0.2.4 fixed it.
Hi, could you share version information? I don't see any information about which version this was seen in and which point release fixed it.
This turned out to be a bug that was fixed in a subsequent point release.
If the web UI doesn't work, SoS isn't useful.
Start by paying attention as to which processes are using all the memory. Is it main splunkd or search processes? (SOS can help if this isn't easy to do yourself -- but doing it yourself with moderate time granularity will give you more data, eg while true; do sleep 60; date; ps aux |grep splunk; done
If you're running Splunk 5.x on Linux you can generate memory profiles using jemalloc by just switching on its MALLOC_CONF environment control variable. I'll try to enrich this tomorrow with the specifics.
If it's Solaris you can just switch on the DEBUG flags for libumem, similarly. Windows is a tougher road.
If this is an indexer on 4.x, one large known cause is bundle replication. Check search heads for lots of large files in /etc . Lookups are the usual culprit.
In general, this should probably be a support case. We should strive to have a better external page about memory growth, but still it's quite difficult to pin stuff down without knowing the the process is, the searches, the data, the growth rate, the version.
Instead of ps, take a look at $SPLUNK_HOME/bin/splunk list jobs
This is a good call to use Splunk on Splunk aka SOS. In addition turn on permon for process contain splunk in name. You should be able to correlate event and search there.
How do I translate what I see via ps to a scheduled/saved search inside Splunk?
First thing I'd do is look at the jobs running during such a memory peak.
For example, I've recently managed to make the splunk on my laptop use 15G simply by asking a very unintelligent search involving multikv and a few huge almost-but-not-quite-table-like events.