I'm having a problem right now where I'm not seeing an even distribution across my indexers. I have 21 indexers (indexer04-indexer24) to which data is coming from six heavy forwarders.
My outputs.conf on my heavy forwarders looks like this:
[tcpout:myServerGroup]
autoLBFrequency=15
autoLB=true
disabled=false
forceTimebasedAutoLB=true
writeTimeout=30
maxConnectionsPerIndexer=20
server=indexer04:9996,indexer05:9996,indexer05:9996,<snip>,indexer24:9996
However, when I run a simple test search, for example
index=main earliest=-1h@h latest=now() | stats count by splunk_indexer | sort count desc
The event count is massively disproportionate across all the indexers, and indexer13 has twice the events of the next busiest indexer, and the least busy indexers have only a sixth of the events that indexer13 has. Likewise, our external hardware monitoring reflects indexer13 having a heavier load.
I've stopped indexer13 temporarily, and the other indexers pick up the slack, but immediately after turning on indexer13 it began being the king of traffic again.
I've broken it down by heavy-forwarder, and every single one of them seems to send more events to indexer13 as well. I'm at a loss, indexer04-indexer24 all share the same configuration, though indexer13-24 are beefier on the hardware side as they are newer builds.
Are there any settings I'm perhaps missing to get this evenly distributed to my indexers?
The issue here ended up being that we were running a version of the heavy forwarders that had a bug -- they'd regularly pick a single indexer preferentially over all others. We're still in the Splunk5 world, so we went forward a few releases and the problem was solved.
The issue here ended up being that we were running a version of the heavy forwarders that had a bug -- they'd regularly pick a single indexer preferentially over all others. We're still in the Splunk5 world, so we went forward a few releases and the problem was solved.
The autoLB feature should function pretty well if looked over a longer timespan - So there is probably some other factor here.
My question to you is :Is there something about indexer 13 that makes it capable of receiving more data in a shorter time than the others? Here are some suggestions.
Could be faster network cards 10Gbit vs 1Gbit or Trunking on the network cards on indexer 13? Something like that?
Powersaving features disabled on the indexer?
Are there routing differences or different vlans for the indexers with different load?
Are there packetloss on some of the connections to the indexers?
Are there queue blocking going on on some of the indexers recieving little data?
This could have many different causes, but is probably not related to the configuration on the heavy forwarders. 🙂