Getting Data In

Splunk Universal Forwarders attempting to communicate with indexer on wrong network

kenniskoldewyn
Explorer

We've installed and are evaluating Splunk Enterprise 6.0 in a Windows environment (desktops are running Windows 7 Professional x64, and servers are running Windows Server 2008 R2). We're very happy with Splunk so far, but have run into a puzzling problem. We have two networks with non-overlapping subnets, network X on 10/8 and network Y on 172.30/16. The Splunk indexer is running on a server that has two NICs, one connected to the X network with static address 10.0.0.50, and one connected to the Y network with static address 172.30.0.50. So, as expected, the indexer receives data from forwarders on both networks. However, each forwarder is attempting to communicate with the indexer on both networks, even though each forwarder is only connected to one network at a time. For example, a forwarder on a desktop on network X will try to communicate with the indexer using the indexer's network Y address 172.30.0.50; needless to say, that communication fails, but the desktop will also try to communicate with the indexer using the indexer's X address 10.0.0.50, which succeeds. All of the forwarders eventually get their data to the indexer, but there are tens of thousands of packets that end up being dropped every day, and the forwarders' metrics.log files are full of lines like:

log/splunk/metrics.log.1:03-10-2014 15:04:37.230 -0400 INFO StatusMgr - destHost=[indexer server name], destIp=[IP on wrong network], destPort=9997, eventType=connect_fail, publisher=tpcout, sourcePort=8089, statusee=TcpOutputProcessor

Expecting that the problem was with our local DNS servers (running on our domain controllers, one of which is the same server that the indexer is on), we checked and confirmed that we had round robin disabled and netmask ordering enabled. So when we nslookup the indexer server on any machine on network X, we receive the answers 10.0.0.50 and 172.30.0.50 (with the correct network first), and when we nslookup the indexer server on any machine on network Y, we receive the answers 172.30.0.50 and 10.0.0.50 (in the opposite order, still with the correct network first). So it appears that our DNS servers are working correctly.

We do not have a deployment server. When we installed the forwarders initially, we specified the plain name of the indexer server, but uninstalling a forwarder, reinstalling it and specifying the fully qualified name of the indexer (indexer.domain.local) doesn't help. For many desktops we can't specify the indexer's IP address when installing the forwarder, because those desktops can switch between the two networks using a Black Box 2-to-1 CAT6 manual switch.

This may not sound like a serious problem, just a waste of network bandwidth, since the forwarders eventually get their data to the indexer, but it's an issue for us, because our firewall is dropping and logging the misdirected packets, and they're adding a lot of noise to the security monitoring we're doing on the firewall. Does anyone have any ideas?

Tags (1)
1 Solution

lguinn2
Legend

The problem is that the forwarders send data to ALL the addresses that are returned by the DNS lookup. This behavior is documented here. See the subsection on "DNS list target". I don't know how to turn off this setting explicitly.

However, you could set these "backoff" settings in outputs.conf - I have not used these, but it seems that they could dramatically reduce (but not eliminate) the excess traffic.

maxFailuresPerInterval =  1
# Specifies the maximum number failures allowed per interval before backoff takes place. Defaults to 2.

secsInFailureInterval = 10
# If the number of write failures exceeds maxFailuresPerInterval in the specified secsInFailureInterval seconds, the forwarder applies backoff. The backoff time period range is 1-10 * autoLBFrequency. Defaults to 1.

backoffOnFailure = 600
# Number of seconds a forwarder will wait before attempting another connection attempt. Defaults to 30

I would probably open a ticket with Splunk Support to see if there is some way to turn off the setting.

As a final thought, here is a work-around (admittedly a bit clunky) that might also help. It requires 2 copies of the outputs.conf file - one set to forward only to 10.0.0.50 and the other set to forward to 172.30.0.50. It also requires a script that runs on a regular schedule (via cron or Splunk or whatever mechanism) which

  1. Checks to see if the address in outputs.conf is reachable (by checking the TCP connection would be best, as ping is ICMP)
  2. If not reachable, the script copies the appropriate version of outputs.conf to the proper directory in the forwarder
  3. Restarts the forwarder (only if the outputs.conf has changed in step 2)

A forwarder restart will not lose data, as the forwarder remembers its current location in the files. There might be short periods where the forwarder attempts to connect to the "wrong" indexer, but scheduling the script to run more often should minimize this.

Finally, solving this problem will make the forwarders run better and provide a faster stream of data to the indexer.

View solution in original post

barakreeves
Splunk Employee
Splunk Employee

For a dual nic environment on a forwarder, try setting the SPLUNK_BINDIP in the $SPLUNK_HOME/etc/splunk-launch.conf file to a specific local IP address. Here is the supporting doc for this: http://docs.splunk.com/Documentation/Splunk/6.0.2/Admin/BindSplunktoanIP

Please let us know if this works or not.

--Barak

0 Karma

kenniskoldewyn
Explorer

I'm afraid you've misunderstood the question. Each forwarder has only one NIC. The receiver has two NICs, one on a 10/8 network and one on a 172.30/16 network. We don't want to bind the receiving splunkd to only one IP; it must receive from forwarders on both networks. The question is: how do we prevent forwarders (each of which is connected to only one network at a time) from attempting to send data out on both networks simultaneously? (Some forwarders can be manually switched from one network to the other, but they're still only connected to one network at a time.)

0 Karma

Jason
Motivator

And if you were interested in going with Deployment Server, you would only need to duplicate the apps necessary to make connections - the app to connect to Deployment Server in the first place and the forwarders' outputs app. Basic configs, ignoring SSL, compression, custom timeouts, etc below:

(Future installs could be automated in the future by having two custom install packages, one each seeded with a yourco_all_deploymentXX app.)

== yourco_all_deploymentclient10/local/deploymentclient.conf ==
[deployment-client]

[target-broker:deploymentServer]
targetUri= 10.0.0.50:8089

== yourco_all_deploymentclient172/local/deploymentclient.conf ==
[deployment-client]

[target-broker:deploymentServer]
targetUri= 172.30.0.50:8089

== yourco_all_forwarder_outputs10/local/outputs.conf ==
[tcpout]
defaultGroup = indexers_via_10

[tcpout:indexers_via_10]
server = 10.0.0.50:9997

== yourco_all_forwarder_outputs172/local/outputs.conf ==
[tcpout]
defaultGroup = indexers_via_172

[tcpout:indexers_via_172]
server = 172.30.0.50:9997

== serverclass.conf ==
[serverClass:all]
# Everyone connect back to Deployment Server
whitelist.0 = *

[serverClass:all:app:yourco_all_deploymentclient10]
whitelist.0 = 10.*
blacklist.0 = 172.*
[serverClass:all:app:yourco_all_deploymentclient172]
whitelist.0 = 172.*
blacklist.0 = 10.*

[serverClass:forwarders]
whitelist.0 = *
blacklist.0 = splunk_infrastructure_etc

[serverClass:forwarders:app:yourco_all_forwarder_outputs10]
whitelist.0 = 10.*
blacklist.0 = 172.*
[serverClass:forwarders:app:yourco_all_forwarder_outputs172]
whitelist.0 = 172.*
blacklist.0 = 10.*

kenniskoldewyn
Explorer

How does this help with a forwarder that has to switch between 10/8 and 172.30/16 networks?

0 Karma

lguinn2
Legend

The problem is that the forwarders send data to ALL the addresses that are returned by the DNS lookup. This behavior is documented here. See the subsection on "DNS list target". I don't know how to turn off this setting explicitly.

However, you could set these "backoff" settings in outputs.conf - I have not used these, but it seems that they could dramatically reduce (but not eliminate) the excess traffic.

maxFailuresPerInterval =  1
# Specifies the maximum number failures allowed per interval before backoff takes place. Defaults to 2.

secsInFailureInterval = 10
# If the number of write failures exceeds maxFailuresPerInterval in the specified secsInFailureInterval seconds, the forwarder applies backoff. The backoff time period range is 1-10 * autoLBFrequency. Defaults to 1.

backoffOnFailure = 600
# Number of seconds a forwarder will wait before attempting another connection attempt. Defaults to 30

I would probably open a ticket with Splunk Support to see if there is some way to turn off the setting.

As a final thought, here is a work-around (admittedly a bit clunky) that might also help. It requires 2 copies of the outputs.conf file - one set to forward only to 10.0.0.50 and the other set to forward to 172.30.0.50. It also requires a script that runs on a regular schedule (via cron or Splunk or whatever mechanism) which

  1. Checks to see if the address in outputs.conf is reachable (by checking the TCP connection would be best, as ping is ICMP)
  2. If not reachable, the script copies the appropriate version of outputs.conf to the proper directory in the forwarder
  3. Restarts the forwarder (only if the outputs.conf has changed in step 2)

A forwarder restart will not lose data, as the forwarder remembers its current location in the files. There might be short periods where the forwarder attempts to connect to the "wrong" indexer, but scheduling the script to run more often should minimize this.

Finally, solving this problem will make the forwarders run better and provide a faster stream of data to the indexer.

kenniskoldewyn
Explorer

Excellent answer—identifies the precise reason for the problem with a link to the documentation, and provides two ideas for mitigation. I'm not terribly happy with either one for the reasons you state (the backoff settings don't eliminate the problem, and the shuffling of outputs.conf files and restarting of the forwarder is awfully kludgy), but I appreciate your thoroughness. It wouldn't surprise me to find out from Splunk Support that there isn't a way to turn off load balancing, so I'll probably end up running the BIND DNS server (which supports DNS "views") instead of Microsoft's. Thanks!

0 Karma
Get Updates on the Splunk Community!

3 Ways to Make OpenTelemetry Even Better

My role as an Observability Specialist at Splunk provides me with the opportunity to work with customers of ...

What's New in Splunk Cloud Platform 9.2.2406?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.2.2406 with many ...

Enterprise Security Content Update (ESCU) | New Releases

In August, the Splunk Threat Research Team had 3 releases of new security content via the Enterprise Security ...