Getting Data In

Why are forwarders not load balancing properly between 2 indexers and how to make configuration changes to fix this?

hlarimer
Communicator

I have 2 indexers setup and have the forwarders set to both of them in outputs.conf. It was my understanding that the forwarders would switch back and forth between each indexer at a regular interval and balance the amount of data being sent to them. Instead it appears in SOS that one indexers is getting most of the traffic over the last 6 weeks.

Is there a setting needed to balance these?

0 Karma

MuS
SplunkTrust
SplunkTrust

Hi hlarimer,

Universal Forwarders have no concept of an event; they just see a stream of data and it's a heavy forwarder or an indexer that will do the event breaking. To avoid sending part of an event to one indexer and the rest to another, when a Universal Forwarder sees a log file getting updated, it will try to read to the end of the file (and then wait three seconds for more data) and send all of that to the same Splunk server.
For a very heavy load this can prevent a Universal Forwarder from properly load blanking. Or if you're indexing old logs, a similar effect can occur. It's probably best just to let it happen in the case of old logs, but if its a matter of the velocity of the logs, you can use autoLBFrequency in outputs.conf to force it to change indexer, but you are increasing the chances of having an event get split.

Indexer selection is random and not a shuffle. Thus if a forwarder is only configured to send data to two servers, there's a 50% chance that it will choose the same server. So for small data sets, you would expect to see some indexer affinity.

hope this helps ...

cheers, MuS

hlarimer
Communicator

I went to look up the recommended settings for autoLBFrequency and ran across these post:
http://blogs.splunk.com/2014/03/18/time-based-load-balancing/
http://blogs.splunk.com/2014/03/26/time-based-load-balancing-part-2/

They recommend using forceTimeBasedAutoLB = true to fix this problem and have some test ran in the second link to prove that it wouldn't create any problems with event splitting.

Any thoughts on this method?

0 Karma

Lucas_K
Motivator

Using the forceTime setting you can get better load balanced non-duplicate events spread across indexes. They won't be 50:50 but should be atleast 70:30 at worst , depending on your time setting (autoLBFrequency=xx). The tighter the setting the more balanced it becomes. This setting is HIGHLY suggested if you have any large bursts of batch inputs.

We had it set to 1 second for around a 18 months (v4-v5). In v6 setting it to 1 second resulted in large indexing issues. Splunk was unable to read from the tcpin buffer fast enough. Default setting is 30 seconds.
As we have aggregating forwarders so I find 30 is far too high and have it set to either 5 or 10 which seems to be a nice compromise.

A forwarder will not switch to another index midstream unless this is set. This is why you can end up with a large amount of data on one index for a particular amount of time. It waits until that stream finishes before switching.

I'm not 100% sure on the event splitting process but it does seem to handle it. I do have a support email somewhere that explains how it works. Something about a forwarder resending the start of the stream/event block again when it starts a new connection (if forceTime is used). The previous indexer will not index that original broken data stream/block as it is incomplete. The second indexer as it receives the entire stream/block will index it as it is complete. The end result being non-duplicate events.

hlarimer
Communicator

We have a large amount of Firewall data coming through a Universal Forwarder that is what needs balanced the most (although all forwarders in this region are going to be load balanced as well). It seems to be balancing well now that I'm using forceTimeBasedAutoLB=true and leaving autoLBFrequency alone (default to 30 seconds?).

Do you suggest setting that to a smaller amount of time? We have some pretty hefty hardware behind each of the indexers (16 core, 32 Gb Ram, Striped RAID with SSD's for hot storage). We mainly need to balance so we can hit 18 month retention on our Data and need our storage to fill up as close to the same speed as possible.

0 Karma

MuS
SplunkTrust
SplunkTrust

Sure go ahead and use this setting, you can always trust @kabains