Getting Data In

Universal Forwarder not load balancing to indexers

Splunk Employee
Splunk Employee

My environment has a few forwarders and 4 indexers, with autoLBFrequency set to 45 seconds. Forwarders' output.conf as follow:

[tcpout:indexCluster]
autoLB = true
autoLBFrequency = 45
server = idx1:9997, idx2:9997, idx3:9997, idx4:9997

However, during load testing, the forwarders do not seem to be load balancing, as only one or two indexers would show load at most of the time, while the other two seem to be idle. This happens at random, meaning any two of the four indexer can show load, or stay idle.

Splunk Employee
Splunk Employee

forceTimebasedAutoLB = [true|false]
* Forces existing streams to switch to newly elected indexer every
AutoLB cycle.
* On universal forwarders, use the EVENT_BREAKER_ENABLE and
EVENT_BREAKER settings in props.conf rather than forceTimebasedAutoLB
for improved load balancing, line breaking, and distribution of events.
* Defaults to false.

0 Karma

Splunk Employee
Splunk Employee

The setting forceTimebasedAutoLB can be used in those cases where UFs will stick to a single indexer due to not reaching EOF.
Note forceTimebasedAutoLB in outputs.conf:


forceTimebasedAutoLB = [true|false]
* Will force existing streams to switch to newly elected indexer every AutoLB cycle.
* Defaults to false, applies to all inputs, change it only when you need it.
* Also note that it is a global setting and cannot be set on an input or stanza level.

Setting it to true will force the UF to switch on autoLB interval and the way it's implemented guarantees event integrity between current indexer and the next but only if events do not span multiple chunks.
Default chunk size is 64KB – so this will work pretty well for most users, minus a few edge cases, via Ditran

Also check out these posts, via Gerald

http://blogs.splunk.com/2014/03/18/time-based-load-balancing/

http://blogs.splunk.com/2014/03/26/time-based-load-balancing-part-2/

Splunk Employee
Splunk Employee

We may have found the reason as to why the universal forwarder is not "load balancing" the data to the indexers equally:

  1. Limitation of universal forwarder (UF). When UF reads data, it is not aware of event boundary (ie. UF does not do parsing)

  2. To prevent inconsistency in the data/events being indexed by the indexer/s, the UF will need to read/forward a file from end to end, until it reaches EOF, disregarding the autoLBFrequency setting if needed.

  3. UF's default thruput is 256 KBps. Anything greater than that will get throttled

In the load test, it seems like the UF needs to read very large historical log files, which seem to hit the above 3 points mentioned. Please correct me if I am wrong. Thanks.

Legend

These are great points - I can see how this could have a large effect when you are loading a lot of historical data. When Splunk has "caught up" and is indexing new events as they occur, I expect that these issues will go away naturally.

In the meantime, you could consider moving the the historical data into a separate directory and using "upload" to bring it into Splunk over a longer period of time. This would lessen both the license impact and the performance impact.

0 Karma

Legend

Here are the things that I would consider:

How active are the forwarders? How much data is being forwarded? It is possible that there is just not enough data flowing to keep the forwarders busy all the time (that would be pretty typical in a lot of environments). Are the indexers experiencing performance issues? If not, then maybe there isn't really a problem here.

Can you reach all the indexers from the forwarders? Is it always the same two that are idle? If a forwarder can't reach an indexer, it skips that indexer and moves to the next one in the rotation. So if an indexer is always skipped, it probably isn't communicating with the forwarders.

Finally, the forwarder does not look for the least busy indexer - it just rotates through the list. So it isn't truly load-balancing. And it could be that the forwarders are really hitting the same indexers at the same time... unlikely, but possible. As the number of forwarders increases, the load will probably "balance" better.

Try this search, which shows which forwarders are sending to which indexers and how much. Run it on the search head if you have one, or wherever you normally login to run searches...

index=_internal source=*metrics.log group=tcpin_connections 
| eval sourceHost=if(isnull(hostname), sourceHost,hostname) 
| rename connectionType as connectType
| eval connectType=case(fwdType=="uf","univ fwder", fwdType=="lwf", "lightwt fwder",fwdType=="full", "heavy fwder", connectType=="cooked" or connectType=="cookedSSL","Splunk fwder", connectType=="raw" or connectType=="rawSSL","legacy fwder")
| eval version=if(isnull(version),"pre 4.2",version)
| rename version as Ver 
| fields connectType sourceIp sourceHost destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server Ver
| eval Indexer= splunk_server
| eval Hour=relative_time(_time,"@h")
| stats avg(tcp_KBps) sum(tcp_eps) sum(tcp_Kprocessed) sum(kb) by Hour connectType sourceIp sourceHost destPort Indexer Ver
| fieldformat Hour=strftime(Hour,"%x %H")

Sort this output in different ways to see what is happening. The search summarizes the activity by hour.

You can also go to the various forwarders and enter the following command

$SPLUNK_HOME/splunk/bin/splunk list forward-server

To see which indexers the forwarder can contact.