Hi everyone,
I'm deploying a Splunk environment with two universal forwarders and two indexers. I've got a primary indexer and a backup one. The backup will be manually brought up if the primary goes down. So i'll have only one indexer up at any time. Both forwarders are up at all times.
In case of failure of the primary indexer, i'll need a human intervention to launch the second one. During this time, my forwarders will still be sending logs (and a lot).
To avoid loosing logs, i'm using a queue on the forwarders. That's what my outputs.conf (on both forwarders) look like (141 is my primary indexer, 142 is the backup) :
[tcpout]
defaultGroup = lb
[tcpout:lb]
server=192.168.100.141:9997, 192.168.100.142:9997
autoLB = true
maxQueueSize = 4GB
dropEventsOnQueueFull = 1
If my primary indexer crashes, here's what happens : 
 - meanwhile, the queue grows on the forwarders, 
 - i manually bring up the backup one, 
 - when the backup indexer is up, forwarders automatically send their logs to it (as per outputs.conf).
This is working great but i've noticed that sometimes, for no specific reason, the memory usage on the forwarder starts growing and never stops until it reaches the maximum RAM.
When this happens, the splunk process is killed as it's using too much memory.
Do you know what's wrong ?
What do i have to do to correct this ?
Please let me know if you need further details.
Regards ands thanks a lot in advance.
FYI : case created on Splunk Support : 87035.
 
					
				
		
Do the indexers share NFS mount for the same buckets (filespace)? If not, then just have both indexers up all the time because all you are getting (if they do not share buckets) is availability of new events (past events are inacccessible until the other indexer is restored). This way you get better performance and you still have the benefit that during a single-indexer outage, you can still search post-outage events that are all coming to the single indexer that is still OK.
 
		
		
		
		
		
	
			
		
		
			
					
		I don't think this is necessarily a good strategy for your forwarders. Basically, what's happening is that your forwarders are throwing connection errors and timing out to the second Splunk indexer (because you keep it offline) and causing it to grow it's heap footprint due to it's queuing mechanism.
Personally, I would keep a second set of outputs on my forwarders that simply points the the second indexer and then manually change the outputs when you're running up your second indexer. You should see much better performance and manageability of your forwarders after doing that. Moreover, you could use something like Tivoli to orchestrate this whole process for you.
Hi there, any thoughts on this issue ?
Thanks in advance.
Just uploaded the diag to the support case.
Changed my outputs.conf to use a single indexer. I'll have to manually edit the outpus.conf in case of a failure. That's far from perfect but if the load balancing config is the problem, i think i have no other choice.
 
		
		
		
		
		
	
			
		
		
			
					
		That's what I would do personally. If your forwarders still behave that way then you may actually have an issue with them.
Yes, i've crated a case, it's number 87035.
I'll upload a diag soon, haven't done it yet.
Do you suggest i change my outputs.conf ?
 
		
		
		
		
		
	
			
		
		
			
					
		I honestly can't answer to the 'why' it grows in heap, but I can say that I've seen this behavior before on some of my larger implementations where indexers go down, for whatever reason, and forwarders tend to grow in size because they have an expectation that their list of indexers will be available. And will therefore grow in size.
I would be more interested on what's really happening on the systems that you're using for forwarders as well. Have you created a ticket for this and uploaded a diag?
Thanks for your answer.
I can understand that it's not standard deployment. I thought of having two outputs.conf files and switching to the second one if indexer number 1 goes down. From my understanding, the AutoLB feature was the solution to avoid that.
So I gave it a go this way. What i don't get is that it works great for a while and then for no specific reason, memory usage starts growing. I mean, it is able to deal with huge amount of events during several days and then in the middle of the night when there are not a lot of events, the problem shows up again.
Thoughts ?
