All Apps and Add-ons

Splunk App for Unix intermittently runs server out of memory

redc
Builder

I just installed the Splunk App for Unix on my Splunk indexer/search head server and installed the add-on on 17 small Linux servers to forward their data into Splunk. 5 of these Linux boxes (server naming convention is "ipp-wpg-uniquename") run Apache (httpd) and I'm interested in knowing how many Apache threads are open on these servers (in our environment, spikes in Apache thread consumption are often the first indicator of a larger problem). So I'm interested in the "ps" sourcetype.

Here's the search I wrote:

sourcetype="ps" host="ipp-wpg-*" app="httpd" USER="webster" earliest=-1m@m latest=@m

This returns only ps events for the 5 servers I'm interested in for the httpd application and run by the webster user - roughly 240-300 events (about 20% of the total events logged for the ps sourcetype during the same period of time). Whenever I run it manually, it returns in 1-3 seconds and consumes maybe 200MB of memory on my indexer/search head server.

I set this up on a scheduled saved search that runs every minute with some additions:

 | timechart span=1m dc(pid) by host | inputlookup append=t apache_thread_counts.csv | outputlookup apache_thread_counts.csv

This gives me a count for each of the hosts with a timestamp and stores all the data in a CSV file (got this trick from Alex Raitz's blog post). That allows me to view historical data without having to run the original search for a longer period of time. Again, I can run this manually and it returns in 3-5 seconds and consumes about 200MB of memory.

In roughly 1 out of every 12-15 executions of the scheduled saved search, it arbitrarily consumes over 20GB of memory on the indexer/search head server, for no apparent reason. My server only has 24GB of memory to start with and typically is "idling" at about 4GB. This causes the server to run out of memory and the scheduled saved search to fail, and, if it happens enough times in a short window, it causes the splunkd service to crash. It's not consistently overlapping with any other scheduled saved search running on the server, so it doesn't seem to be some sort of conflict with another search that's causing the behavior.

I need the 60-second granularity for our SLA. I've tried tweaking it so that it runs for 2 minutes' worth of data, but only runs every 2 minutes, and that doesn't seem to help at all.

I could reduce the frequency of ps.sh on the Linux servers from 30 seconds to 60 seconds (default is 30), but I don't really think that's the problem - if that were the problem, I'd expect to see this absurd level of memory consumption every time it runs, not every once in a while.

Beefing up the server is of limited use (and isn't inexpensive), and given the behavior, I wouldn't be surprised if it just continued to consume memory until it ran out of memory, no matter how much hardware I throw at it.

1 Solution

redc
Builder

I ended up creating a custom index (os_ps) for the output of ps.sh. This significantly reduces the amount of data that the saved search has to load to find just the ps output and eliminated the memory issue.

View solution in original post

redc
Builder

I ended up creating a custom index (os_ps) for the output of ps.sh. This significantly reduces the amount of data that the saved search has to load to find just the ps output and eliminated the memory issue.

View solution in original post

redc
Builder

Point of clarification: our indexer/search head server is a virtual running Windows 2008 R2 and has 8 CPUs as well as the 24GB RAM.

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!