Getting Data In

What would cause forwarders to only use 25% of the IPs in my DNS list of indexers?

twinspop
Influencer

I have a DNS entry set up for my 12 indexers. Recently I noticed a large consumer was throwing my traffic balance out of whack. I did some checking around metrics and see that all his servers (and some others to boot) are only cycling through 3 of the servers in the DNS list. Running host myindexers.private returns the entire list on the Forwarders' systems. But they are sending to only 3 of the returned IPs.

I have 8000 forwarders. About 3% of them are impacted by this weirdness. I'm still trying to narrow down what the commonalities are. It is always exactly 3 out of the 12.

EDIT: with a few dozen different versions in play, interesting that only 6.6.0 - 6.6.3 is showing this. (I don't have anything newer than 6.6.3 in service.)

EDIT2: Confirmed, no other versions are experiencing the indexer drop. Only 6.6.1 and 6.6.3. Turns out I have no 6.6.0 nor 6.6.2.

EDIT3: There appears to be an issue with 6.6.1 and 6.6.3, at least, where the longer the forwarder runs, the fewer indexers it will even attempt to talk to. I can reliably reproduce this issue with fresh, very basic, minimalistic installs. I cannot reproduce on versions before 6.6.x. This is true whether i list all indexers in outputs.conf, or use a DNS A list. Ticket open, diags exchanged, waiting...

0 Karma
1 Solution

twinspop
Influencer

Splunk 6.6.6 fixes this issue (edit: with no indication in release notes!). I can run clean installs with minimal, identical configs on 6.5.3, 6.6.3 and 6.6.6 and watch load balancing fail on larger clusters with the 6.6.3 version and work fine with the others. Entirely predictable. 6.6.3 eventually lands on just using 3 indexers over and over. No LB to the others.

This happens with SUF and full install.

If you're using 6.6.0-3 in an environment with lots of indexer targets, i'd strongly urge you to upgrade. ASAP.

On the MC, I run this search to find the suspected baddies:

`dmc_get_forwarder_tcpin` | stats values(fwdType) as fwdType, values(sourceIp) as sourceIp, latest(version) as version,  values(os) as os, values(arch) as arch, p90(tcp_KBps) as avg_tcp_kbps, dc(splunk_server) as Indexers by hostname | where Indexers<6 

Like clockwork, the only version with 3 indexers listed is 6.6.1-6.6.3 (i never deployed 6.6.0)

View solution in original post

0 Karma

twinspop
Influencer

Splunk 6.6.6 fixes this issue (edit: with no indication in release notes!). I can run clean installs with minimal, identical configs on 6.5.3, 6.6.3 and 6.6.6 and watch load balancing fail on larger clusters with the 6.6.3 version and work fine with the others. Entirely predictable. 6.6.3 eventually lands on just using 3 indexers over and over. No LB to the others.

This happens with SUF and full install.

If you're using 6.6.0-3 in an environment with lots of indexer targets, i'd strongly urge you to upgrade. ASAP.

On the MC, I run this search to find the suspected baddies:

`dmc_get_forwarder_tcpin` | stats values(fwdType) as fwdType, values(sourceIp) as sourceIp, latest(version) as version,  values(os) as os, values(arch) as arch, p90(tcp_KBps) as avg_tcp_kbps, dc(splunk_server) as Indexers by hostname | where Indexers<6 

Like clockwork, the only version with 3 indexers listed is 6.6.1-6.6.3 (i never deployed 6.6.0)

View solution in original post

0 Karma

woodcock
Esteemed Legend

There are 2 main reasons; either:

The servers in question do not have the correct outputs.conf configuration (or they do and the service has not been bounced to make it effective; check restartSplunkd on DS)
OR
The servers in question cannot communicate with some of the indexers (missing route, wrong NIC, firewall is blocking, etc.)

0 Karma

twinspop
Influencer

See answer above (TL;DR v6.6.6 fixes this issue. Not just my imagination! 🙂

0 Karma

twinspop
Influencer

They have identical settings and network access as other servers:

  • other 6.6.3 servers that can talk to other indexers, albeit still only 25% of the cluster
  • other pre-6.6 servers that talk to all 12 without issue
  • running with debug on for tcpoutputproc i can see that the impacted servers lookup the entire list just fine. All indexers in the cluster are returned to them. They just never even attempt to contact more than 25% of them.
0 Karma

twinspop
Influencer

I've created a new app for the busiest hosts to use with all 12 indexers listed instead of the single DNS entry. They are load balancing fine with this config. Still trying to figure out why they're limited to 3 entries on a DNS A list.

0 Karma

somesoni2
Revered Legend

What does the btool output for outputs.conf looks like on these forwarders?

/opt/splunk/bin/splunk btool outputs list --debug

In the _internal logs on those forwarders, do you see any error for connection to other 9 indexers?

0 Karma

twinspop
Influencer

btool output is as expected. Aside from defaults:

[tcpout]
defaultGroup = prodDC1_indexers

[tcpout:prodDC1_indexers]
autoLBFrequency = 40
compressed = true
forceTimebasedAutoLB = true
server = myindexers.private:9997
useACK = true

The internal logs have literally zero mention of any other IPs. It's as if the SUF is only seeing 3 IPs. The 3 selected vary from SUF to SUF.

0 Karma

MuS
SplunkTrust
SplunkTrust

If you run a Splunk bash session $SPKUNK_HOME/bin/splunk cmd bash and within this session do a nslookup,dig, ping tests and other network DNS related tests, do all the 12 configured idx ever show up as DNS response?

cheers, MuS

0 Karma

twinspop
Influencer

More interesting is that when running with tcpoutputproc logging set to debug, I can see that Splunk itself is getting all 12 hosts returned from the lookup. It just ignores 75% of them. In 6.6.1 and 6.6.3.

0 Karma

twinspop
Influencer

Done. Looks the same, which is to say, totally normal. All 12 hosts show up.

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!