I am having an issue with Splunk 4.3.3 regarding a memory issue. For one reason or another, it seems to have a constant linear rise in memory usage.
The Splunk instance is running on RHEL 5 2.6.18-238.el5 with 4 GB of RAM and 2 CPU cores. It is also worth noting it is a VM. There are no searches running, CPU usage is idle. I do not have an excessive number of logs being ingested. 21 folders are monitored for logs with the followTail option enabled to prevent re-indexing. I also have 5 Windows machines sending logs via TCP using the Universal Forwarder. They are separated into 5 different indexes and are searchable and handled fine.
I have monitored the system via TOP and shows no memory allocation to cache or buffer. But for one reason or another, it just continues to rise. Eventually over the span of a few hours, OOM killer is invoked and starts killing processes to try and free up memory, but eventually all memory including swap is consumed and the system crashes. I have tried disabling ALL indexers and monitoring of files (except main). Originally I was running 4.3.2, and tried 4.3.3 to see if it would resolve the memory leaks.
splunkd.log:
08-25-2012 08:10:34.142 +0000 ERROR TcpInputFd - SSL Error for fd from HOST:X.X.X.X, IP:X.X.X.X, PORT:57483
08-25-2012 08:26:45.668 +0000 ERROR TcpInputFd - SSL_ERROR_SYSCALL ret errno:32
08-25-2012 08:27:29.414 +0000 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (4009562 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:28:25.020 +0000 ERROR TcpInputFd - SSL Error = error:00000000:lib(0):func(0):reason(0)
08-25-2012 08:28:36.526 +0000 ERROR TcpInputFd - ACCEPT_RESULT=-1 VERIFY_RESULT=0
08-25-2012 08:28:44.876 +0000 INFO PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (17956 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:29:16.211 +0000 ERROR TcpInputFd - SSL Error for fd from HOST:X.X.X.X, IP:X.X.X.X, PORT:57484
08-25-2012 08:29:33.534 +0000 INFO PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (14323 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:30:36.464 +0000 INFO PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (20315 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:31:26.786 +0000 INFO PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (17646 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:32:16.086 +0000 INFO PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (12923 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:33:36.634 +0000 INFO PipelineComponent - HTTPAuthManager:timeoutCallback() took longer than seems reasonable (22673 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:34:05.872 +0000 INFO PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (13138 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:34:28.833 +0000 INFO PipelineComponent - IndexProcessor:ipCallback() took longer than seems reasonable (10159 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
08-25-2012 08:46:57.738 +0000 ERROR TcpInputFd - SSL_ERROR_SYSCALL ret errno:32
08-25-2012 08:47:34.291 +0000 ERROR TcpInputFd - SSL Error = error:00000000:lib(0):func(0):reason(0)
08-25-2012 08:47:40.562 +0000 ERROR TcpInputFd - ACCEPT_RESULT=-1 VERIFY_RESULT=0
08-25-2012 08:48:25.859 +0000 ERROR TcpInputFd - SSL Error for fd from HOST:X.X.X.X, IP:X.X.X.X, PORT:57485
08-25-2012 08:54:42.218 +0000 FATAL ProcessRunner - Unexpected EOF from process runner child!
08-25-2012 08:56:36.406 +0000 ERROR ProcessRunner - helper process seems to have died (child killed by signal 9: Killed)!
rsyslog messages:
Aug 25 08:31:41 server kernel: Node 0 HighMem: empty
Aug 25 08:31:41 server kernel: 1594 pagecache pages
Aug 25 08:31:41 server kernel: Swap cache: add 8929081, delete 8928541, find 1593018/3379335, race 7+4812
Aug 25 08:31:41 server kernel: Free swap = 4941840kB
Aug 25 08:31:41 server kernel: Total swap = 5245212kB
Aug 25 08:31:41 server kernel: Free swap: 4941840kB
Aug 25 08:31:41 server kernel: 1310720 pages of RAM
Aug 25 08:31:41 server kernel: 299836 reserved pages
Aug 25 08:31:41 server kernel: 5504 pages shared
Aug 25 08:31:41 server kernel: 543 pages swap cached
Aug 25 08:31:41 server kernel: Out of memory: Killed process 21471, UID 1502, (scanner).
Aug 25 08:31:41 server kernel: splunkd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
The only other process that occurs is a log rotation by auditd. Would any guest OS settings cause Splunk to have a memory leak? Any ideas on how to resolve this would be great.
... View more