Getting Data In

Heavy forwarders are not auto load-balancing evenly

rjdargi
Explorer

I'm having a problem right now where I'm not seeing an even distribution across my indexers. I have 21 indexers (indexer04-indexer24) to which data is coming from six heavy forwarders.

My outputs.conf on my heavy forwarders looks like this:

[tcpout:myServerGroup]
autoLBFrequency=15
autoLB=true
disabled=false
forceTimebasedAutoLB=true
writeTimeout=30
maxConnectionsPerIndexer=20
server=indexer04:9996,indexer05:9996,indexer05:9996,<snip>,indexer24:9996

However, when I run a simple test search, for example

index=main earliest=-1h@h latest=now() | stats count by splunk_indexer | sort count desc

The event count is massively disproportionate across all the indexers, and indexer13 has twice the events of the next busiest indexer, and the least busy indexers have only a sixth of the events that indexer13 has. Likewise, our external hardware monitoring reflects indexer13 having a heavier load.

I've stopped indexer13 temporarily, and the other indexers pick up the slack, but immediately after turning on indexer13 it began being the king of traffic again.

I've broken it down by heavy-forwarder, and every single one of them seems to send more events to indexer13 as well. I'm at a loss, indexer04-indexer24 all share the same configuration, though indexer13-24 are beefier on the hardware side as they are newer builds.

Are there any settings I'm perhaps missing to get this evenly distributed to my indexers?

0 Karma
1 Solution

rjdargi
Explorer

The issue here ended up being that we were running a version of the heavy forwarders that had a bug -- they'd regularly pick a single indexer preferentially over all others. We're still in the Splunk5 world, so we went forward a few releases and the problem was solved.

View solution in original post

rjdargi
Explorer

The issue here ended up being that we were running a version of the heavy forwarders that had a bug -- they'd regularly pick a single indexer preferentially over all others. We're still in the Splunk5 world, so we went forward a few releases and the problem was solved.

jofe
Explorer

The autoLB feature should function pretty well if looked over a longer timespan - So there is probably some other factor here.

My question to you is :Is there something about indexer 13 that makes it capable of receiving more data in a shorter time than the others? Here are some suggestions.

Could be faster network cards 10Gbit vs 1Gbit or Trunking on the network cards on indexer 13? Something like that?
Powersaving features disabled on the indexer?
Are there routing differences or different vlans for the indexers with different load?
Are there packetloss on some of the connections to the indexers?
Are there queue blocking going on on some of the indexers recieving little data?

This could have many different causes, but is probably not related to the configuration on the heavy forwarders. 🙂

Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Community Content Calendar, September edition

Welcome to another insightful post from our Community Content Calendar! We're thrilled to continue bringing ...

Splunkbase Unveils New App Listing Management Public Preview

Splunkbase Unveils New App Listing Management Public PreviewWe're thrilled to announce the public preview of ...

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Are you leveraging automation to its fullest potential in your threat detection strategy?Our upcoming Security ...