Hi,
We’re currently facing a load imbalance issue in our Splunk deployment and would appreciate any advice or best practices.
Current Setup:
Universal Forwarders (UFs) → Heavy Forwarders (HFs) → Cribl
We originally had 8 HFs handling parsing and forwarding.
Recently, we added 6 new HFs (total of 14 HFs) to help distribute the load more evenly and to offload congested older HFs.
All HFs are included in the UFs’ outputs.conf under the same TCP output group.
Issue:
We’re seeing that some of the original 8 HFs are still showingblocked=true in metrics.log (splunktcpin queue full), while the newly added HFs have little to no traffic.
It looks like the load is not being evenly distributed across the available HFs.
Here's our current outputs.conf deployed in UFs:
[tcpout]
defaultGroup = HF_Group
forwardedindex.2.whitelist = (_audit|_introspection|_internal)
[tcpout:HF_Group]
server = HF1:9997,HF2:9997,...HF14:9997
We have not set autoLBFrequency yet.
Questions:
Do we need to set autoLBFrequency in order to achieve true active load balancing across all 14 HFs, even when none of them are failing?
If we set autoLBFrequency = 30, are there any potential downsides (e.g., performance impact, TCP session churn)?
Are there better or recommended approaches to ensure even distribution of UF traffic to multiple HFs in environments before forwarding to Cribl?
Please note that we are sending a large volume of data, primarily wineventlogs.
Your help is very much appreciated. Thank you
Yup. The so-called asynchronous forwarding or asynchronous load balancing helps greatly in reducing imbalance in data distribution. Without it, when just using time-based LB, a HF sends to one indexer for a specified period of time, then switches to another, then to another. But at any given point in time it only sends to one output. (unless you're using multiple ingestion pipelines in which case you will have multiples of this setup).
And, adding to those pipelines - as you're having a separate HF layer, you might want to try to increase your pipeline count if you have spare resources (mostly CPU) on your HFs. You need to adjust your loadbalancing parameters accordingly.
Yes, you could set autoLBFrequency to achieve active load balancing across the output of your UFs to all Heavy Forwarders.
[tcpout:HF_Group] server = HF1:9997,HF2:9997,...HF14:9997 autoLBFrequency = 30
The other option is to use volume based LB configuration - its worth checking out https://help.splunk.com/en/splunk-enterprise/forward-and-process-data/forwarding-and-receiving-data/...to see which would be more appropriate for your usecase.
The potential downsides of autoLBFrequency would be the TCP connection churn: New connections created every 30 seconds, there could be a *slight* performance overhead due to Connection establishment costs however I wouldnt expect this to be too noticable.
Check out https://community.splunk.com/t5/Getting-Data-In/Universal-Forwarder-not-load-balancing-to-indexers/m... which might also help.
The other thing to consider is an increased number of pipelines - but again its worth understanding the implications of this and considering your available processing resource on the UFs/HFs. Are you currently using the default of 1? See https://docs.splunk.com/Documentation/Splunk/latest/Admin/Serverconf#Remote_applications_configurati... for more info.
Finally - what is the datasource into your UFs? Sometimes sources like syslog can make it tricky to LB effectively.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing