My environment has a few forwarders and 4 indexers, with autoLBFrequency set to 45 seconds. Forwarders' output.conf as follow:
autoLB = true
autoLBFrequency = 45
server = idx1:9997, idx2:9997, idx3:9997, idx4:9997
However, during load testing, the forwarders do not seem to be load balancing, as only one or two indexers would show load at most of the time, while the other two seem to be idle. This happens at random, meaning any two of the four indexer can show load, or stay idle.
* Forces existing streams to switch to newly elected indexer every
* On universal forwarders, use the
EVENT_BREAKER settings in props.conf rather than
for improved load balancing, line breaking, and distribution of events.
* Defaults to false.
The setting forceTimebasedAutoLB can be used in those cases where UFs will stick to a single indexer due to not reaching EOF.
Note forceTimebasedAutoLB in outputs.conf:
forceTimebasedAutoLB = [true|false]
* Will force existing streams to switch to newly elected indexer every AutoLB cycle.
* Defaults to false, applies to all inputs, change it only when you need it.
* Also note that it is a global setting and cannot be set on an input or stanza level.
Also check out these posts, via Gerald
We may have found the reason as to why the universal forwarder is not "load balancing" the data to the indexers equally:
Limitation of universal forwarder (UF). When UF reads data, it is not aware of event boundary (ie. UF does not do parsing)
To prevent inconsistency in the data/events being indexed by the indexer/s, the UF will need to read/forward a file from end to end, until it reaches EOF, disregarding the autoLBFrequency setting if needed.
UF's default thruput is 256 KBps. Anything greater than that will get throttled
In the load test, it seems like the UF needs to read very large historical log files, which seem to hit the above 3 points mentioned. Please correct me if I am wrong. Thanks.
These are great points - I can see how this could have a large effect when you are loading a lot of historical data. When Splunk has "caught up" and is indexing new events as they occur, I expect that these issues will go away naturally.
In the meantime, you could consider moving the the historical data into a separate directory and using "upload" to bring it into Splunk over a longer period of time. This would lessen both the license impact and the performance impact.
Here are the things that I would consider:
How active are the forwarders? How much data is being forwarded? It is possible that there is just not enough data flowing to keep the forwarders busy all the time (that would be pretty typical in a lot of environments). Are the indexers experiencing performance issues? If not, then maybe there isn't really a problem here.
Can you reach all the indexers from the forwarders? Is it always the same two that are idle? If a forwarder can't reach an indexer, it skips that indexer and moves to the next one in the rotation. So if an indexer is always skipped, it probably isn't communicating with the forwarders.
Finally, the forwarder does not look for the least busy indexer - it just rotates through the list. So it isn't truly load-balancing. And it could be that the forwarders are really hitting the same indexers at the same time... unlikely, but possible. As the number of forwarders increases, the load will probably "balance" better.
Try this search, which shows which forwarders are sending to which indexers and how much. Run it on the search head if you have one, or wherever you normally login to run searches...
index=_internal source=*metrics.log group=tcpin_connections | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | rename connectionType as connectType | eval connectType=case(fwdType=="uf","univ fwder", fwdType=="lwf", "lightwt fwder",fwdType=="full", "heavy fwder", connectType=="cooked" or connectType=="cookedSSL","Splunk fwder", connectType=="raw" or connectType=="rawSSL","legacy fwder") | eval version=if(isnull(version),"pre 4.2",version) | rename version as Ver | fields connectType sourceIp sourceHost destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server Ver | eval Indexer= splunk_server | eval Hour=relative_time(_time,"@h") | stats avg(tcp_KBps) sum(tcp_eps) sum(tcp_Kprocessed) sum(kb) by Hour connectType sourceIp sourceHost destPort Indexer Ver | fieldformat Hour=strftime(Hour,"%x %H")
Sort this output in different ways to see what is happening. The search summarizes the activity by hour.
You can also go to the various forwarders and enter the following command
$SPLUNK_HOME/splunk/bin/splunk list forward-server
To see which indexers the forwarder can contact.