Monitoring Splunk

Memory Leak - Splunk 4.3.3

tbaeg
Explorer

I am having an issue with Splunk 4.3.3 regarding a memory issue. For one reason or another, it seems to have a constant linear rise in memory usage.

The Splunk instance is running on RHEL 5 2.6.18-238.el5 with 4 GB of RAM and 2 CPU cores. It is also worth noting it is a VM. There are no searches running, CPU usage is idle. I do not have an excessive number of logs being ingested. 21 folders are monitored for logs with the followTail option enabled to prevent re-indexing. I also have 5 Windows machines sending logs via TCP using the Universal Forwarder. They are separated into 5 different indexes and are searchable and handled fine.

I have monitored the system via TOP and shows no memory allocation to cache or buffer. But for one reason or another, it just continues to rise. Eventually over the span of a few hours, OOM killer is invoked and starts killing processes to try and free up memory, but eventually all memory including swap is consumed and the system crashes. I have tried disabling ALL indexers and monitoring of files (except main). Originally I was running 4.3.2, and tried 4.3.3 to see if it would resolve the memory leaks.

splunkd.log:

08-25-2012 08:10:34.142 +0000 ERROR TcpInputFd - SSL Error for fd from HOST:X.X.X.X, IP:X.X.X.X, PORT:57483
08-25-2012 08:26:45.668 +0000 ERROR TcpInputFd - SSL_ERROR_SYSCALL ret errno:32
08-25-2012 08:27:29.414 +0000 INFO  PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (4009562 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:28:25.020 +0000 ERROR TcpInputFd - SSL Error = error:00000000:lib(0):func(0):reason(0)
08-25-2012 08:28:36.526 +0000 ERROR TcpInputFd - ACCEPT_RESULT=-1 VERIFY_RESULT=0
08-25-2012 08:28:44.876 +0000 INFO  PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (17956 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:29:16.211 +0000 ERROR TcpInputFd - SSL Error for fd from HOST:X.X.X.X, IP:X.X.X.X, PORT:57484
08-25-2012 08:29:33.534 +0000 INFO  PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (14323 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:30:36.464 +0000 INFO  PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (20315 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:31:26.786 +0000 INFO  PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (17646 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:32:16.086 +0000 INFO  PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (12923 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:33:36.634 +0000 INFO  PipelineComponent - HTTPAuthManager:timeoutCallback() took longer than seems reasonable (22673 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:34:05.872 +0000 INFO  PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (13138 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:34:28.833 +0000 INFO  PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (10159 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:46:57.738 +0000 ERROR TcpInputFd - SSL_ERROR_SYSCALL ret errno:32
08-25-2012 08:47:34.291 +0000 ERROR TcpInputFd - SSL Error = error:00000000:lib(0):func(0):reason(0)
08-25-2012 08:47:40.562 +0000 ERROR TcpInputFd - ACCEPT_RESULT=-1 VERIFY_RESULT=0
08-25-2012 08:48:25.859 +0000 ERROR TcpInputFd - SSL Error for fd from HOST:X.X.X.X, IP:X.X.X.X, PORT:57485
08-25-2012 08:54:42.218 +0000 FATAL ProcessRunner - Unexpected EOF from process runner child!
08-25-2012 08:56:36.406 +0000 ERROR ProcessRunner - helper process seems to have died (child killed by signal 9: Killed)!

rsyslog messages:

Aug 25 08:31:41 server kernel: Node 0 HighMem: empty
Aug 25 08:31:41 server kernel: 1594 pagecache pages
Aug 25 08:31:41 server kernel: Swap cache: add 8929081, delete 8928541, find 1593018/3379335, race 7+4812
Aug 25 08:31:41 server kernel: Free swap  = 4941840kB
Aug 25 08:31:41 server kernel: Total swap = 5245212kB
Aug 25 08:31:41 server kernel: Free swap:       4941840kB
Aug 25 08:31:41 server kernel: 1310720 pages of RAM
Aug 25 08:31:41 server kernel: 299836 reserved pages
Aug 25 08:31:41 server kernel: 5504 pages shared
Aug 25 08:31:41 server kernel: 543 pages swap cached
Aug 25 08:31:41 server kernel: Out of memory: Killed process 21471, UID 1502, (scanner).
Aug 25 08:31:41 server kernel: splunkd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0

The only other process that occurs is a log rotation by auditd. Would any guest OS settings cause Splunk to have a memory leak? Any ideas on how to resolve this would be great.

ncsantucci
Path Finder

lguinn2
Legend

I don't know about the memory leak, but I definitely have some thoughts.

"21 folders are monitored for logs with the followTail option enabled to prevent re-indexing."

followTail does not prevent re-indexing. It only tells Splunk to start at the end of the file the first time that it sees a file; after that, it is ignored.

Can you run the following command, to see what is actually being monitored?

./splunk list monitor

I wonder if Splunk is perhaps monitoring more than you realize. This can happen when many files exist, but are inactive. Splunk still monitors these files, although you may not see any data. Monitoring inactive files costs both memory and CPU cycles. It helps to regularly archive and remove files that are no longer being used.

The messages in the splunkd log clearly show that splunk is crying for more resources.

Your VM is massively underpowered compared to a physical Splunk installation. Do you know how many I/Os per second that your VM can read/write? Splunk needs 800 IOPS to perform well. Many VMs can only deliver about 50 IOPS.

As far as your guest OS, I don't know of any problem between Splunk and the guest. However, memory ballooning could certainly be depriving Splunk of some memory that it needs. You should set a memory reservation for this VM, and perhaps a CPU reservation as well. In addition, turn off memory overcommit. There is an article (PDF) that talks about running Splunk in a VM: Splunk and VMware VMs TEch Brief

Another related post: Can I run Splunk in a VM

And maybe If Splunk is struggling and timing out in various places, perhaps that either causes a memory leak or causes massive memory usage. I don't know.

tbaeg
Explorer

Regardless, I have setup a reservation of 12GB of RAM and 2 sockets with 2 cores. I really don't see the need for more than that. The IOPS count was a concern for me, but it makes no sense for RAM to just have a linear growth. Eventually leading to a no memory error then system failure. Any other ideas?

0 Karma

tbaeg
Explorer

I have checked the list of files being monitored, and they are not outside the bounds of what I have configured. I have also setup whitelists so whatever is being monitored in the directories is checked (via whitelist). Poor performance I could understand, but that isn't the issue. The articles you have outlined show/talk about performance effects, but my issue isn't performance related at all. Real time searches and indexing is managed without a hitch. We get less than 20 MB of logs per day (for now), so I can't say Splunk performs any memory or CPU intensive tasks.

0 Karma

dwaddle
SplunkTrust
SplunkTrust

This is probably unrelated to your problem, but followTail will not in most cases prevent re-indexing. For this problem I would recommend a Splunk support case.

Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.