Deployment Architecture

DNS load balancing deployment client traffic

Builder

I would like to be able distribute the load of universal forwarder based deployment client traffic across my pool of deployment servers using DNS load balancing without encountering checksum mismatches and splunkd restarts any time the universal forwarder based deployment client interacts with a different deployment server in the deployment server resource pool.

It is my understanding that the crossServerChecksum = true configuration option was introduced within serverclass.conf in a recent version of Splunk in order to address this type of problem. Has anyone had success in use of crossServerChecksum or any other combination of options in order to avoid repeated checksum mismatches, app downloads and splunk restarts when DNS load balancing deployment clients? CrossServerChecksum = true is not working for me but perhaps I have implemented it incorrectly.

Labels (2)
0 Karma

Splunk Employee
Splunk Employee

This configuration still works for us on version 8.0.1. Note we do see handshake errors in some cases (not the same error as yours). This is almost always a firewall rule for a small set of hosts that typically works once the FW rule is updated. Can you confirm this is failing for ALL hosts contacting your DS or if it is isolated to a set of hosts.

0 Karma

Explorer

It failing for all my hosts and the handshake can never complete.

I have changed to use a F5 load balancer instead and it's working and in fact, better. Coz there' proper load balancing and it can also prevent the client from hitting a node which is down. Hope this helps anyone who is trying to implement multi deployment server using DNS round robin. I have only tested with Windows DNS tho. It might work with NAMED. Otherwise, avoid using DNS round robin for multi deployment server set-up.

0 Karma

Explorer

couldnt get it to work.

05-15-2020 22:49:01.666 +0800 WARN DC:PhonehomeThread - No response to handshake for too long; starting over.
05-15-2020 22:49:01.666 +0800 WARN DC:PhonehomeThread - No response to handshake for too long; starting over.

can never establish a handshake. Using Windows DNS. Anyone has any clue if this method still works?

0 Karma

Esteemed Legend

As of Splunk 6.3 and later, there is now an optional attribute crossServerChecksum in serverclass.conf.
Default is false - the old way - so your upgrade to 6.3.x+ doesn't immediately resend all apps to all clients. But if set to true it uses a different algorithm (which does not include timestamps and such) to make a DS react the same as another - so they can sit behind a load balancer. Also be sure to check out the little-known but super-powerful endpoint setting:
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Serverclassconf#endpoint

0 Karma

Builder

When we upgraded to Splunk 6.3 providing initial support for the feature our implementation did not have the desired effect. With the crossServerCheckesum enabled amid DNS load balanced deployment servers, clients still exhibited recurring checksum mismatches, app downloads, and deployment client restarts. I'm wide open to the idea that our implementation was flawed. The intent of my Question was to determine if anyone else actually had success with the feature in 6.3 or later versions. Based on ejenson's input as well as myriad upgrades we have implemented since 6.3, a retry is certainly worth a shot!

I was unaware of the endpoint setting. Now I am curious whether that channel provides for secure communication and to what extent delays in phone home request responses are influenced by app download activity.

0 Karma

Esteemed Legend

The VAST majority of delays and scaling are to to the downloading. The xscs setting must be done from the very beginning or it will not work right.

0 Karma

Splunk Employee
Splunk Employee

We ended up getting this working using DNS load balancing and the setting crossServerChecksum=true . We had to use DNS load balancing to retain the the originating host ip as well. Otherwise the ability to whitelist by IP in DS wouldn't work. Additionally we have upgraded a few times since this post was originally written.

0 Karma

Explorer

We have implemented a new design with DNS load balancing, that we currently have issues with.
DNS is configured with a A record with 2 IPs defined. That LB hostname defined in deploymentclient.conf on the UF.

However as soon as we have more than 1 backen server active the UF fails on initial Phonehome handshake. With wirehark we can see that the traffic is split between our 2 IP's and the handshake never completes. It can run for days, trying.

01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Changed state from=Initial to=Initial
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Attempting handshake
01-30-2020 13:16:32.828 +0100 DEBUG DC:DeploymentClient - Sending message <handshake/> to tenantService/handshake
01-30-2020 13:16:32.828 +0100 DEBUG HttpPubSubConnection - HttpClientPollingThread Woke up
01-30-2020 13:16:32.828 +0100 DEBUG HttpPubSubConnection - Not waiting as we have '1' requests in queue
01-30-2020 13:16:32.828 +0100 INFO  HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.11.12.13_48089_WIN2019.oneadr.net_WIN2019_7FDF738A-330F-42E1-BB34-B1EBCD881E67
01-30-2020 13:16:32.828 +0100 DEBUG HttpPubSubConnection - Will now wait for pollingInterval of 60.000 secs
01-30-2020 13:16:32.828 +0100 DEBUG DC:DeploymentClient - channel=tenantService/handshake Success sending handshake to DS.
01-30-2020 13:16:32.828 +0100 DEBUG DC:DeploymentClient - Changed state from=Initial to=HandshakeInProgress
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.000sec
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=HandshakeInProgress
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.000sec
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Phonehome thread will wait for 12.000sec (1)
01-30-2020 13:16:44.833 +0100 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=HandshakeInProgress
01-30-2020 13:16:44.833 +0100 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.000sec
01-30-2020 13:16:44.833 +0100 DEBUG DC:PhonehomeThread - Phonehome thread will wait for 12.000sec (1)
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=HandshakeInProgress
01-30-2020 13:16:56.832 +0100 WARN  DC:PhonehomeThread - No response to handshake for too long; starting over.
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - Changed state from=HandshakeInProgress to=Initial
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.000sec
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=Initial
01-30-2020 13:16:56.832 +0100 WARN  DC:PhonehomeThread - No response to handshake for too long; starting over.
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - Changed state from=Initial to=Initial
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - Attempting handshake
01-30-2020 13:16:56.832 +0100 DEBUG DC:DeploymentClient - Sending message <handshake/> to tenantService/handshake
01-30-2020 13:16:56.832 +0100 DEBUG HttpPubSubConnection - HttpClientPollingThread Woke up
01-30-2020 13:16:56.832 +0100 DEBUG HttpPubSubConnection - Not waiting as we have '1' requests in queue

Deployment server running RHEL 7, Splunk 7.3.4
Deployment Client on Windows 2019, Splunk UF 7.3.3

@ejenson_splunk Have you experienced anything like this in your setup? Much appreciate if you could share details of your setup as there seem to be some differences that I'm not able to find.

0 Karma

Explorer

i'm experiencing the same thing with DNS load balancing for DS. Did you managed to fix it?

0 Karma

Explorer

We abandoned DNS load balancing and developed our own semi dynamic Deployment Server allocation instead.

Added functionality to our UF install scripts so that it based on the host IP address calculates a number between 1-8 (Modulus calculation on the integer representation of the IP address) + 1
This number is then used to set targetUri in deploymentclient.conf to e.g. splunkds01.company,com

So we then have 8 predefined entries in DNS for splunkds01-08.company.com. Then based on load on the Splunk deployment server we can then change the DNS records to point to any number between 1-8 real Deployment server. We have recently expanded capacity from 2-4 Deployment servers in a specific network zone. Deployed 2 new deployment servers with Ansible, updated the DNS A-record for the above predefined entries, and all worked like a charm.

0 Karma

Splunk Employee
Splunk Employee

We have setup an ELB in AWS however when the UF contacts the DS the ELB's IP address and DNS are replaced so all of the hosts in the forwarder management are the same. We're you able to resolve this issue as well?

0 Karma

Engager

@ejenson Are you able to get a solution for this ? i am experiencing same problem

0 Karma

Splunk Employee
Splunk Employee

No no answer has been forthcoming. Have you had any luck?

0 Karma

Path Finder

Hi Ejenson, Did you ever get an answer to this question?

0 Karma

Splunk Employee
Splunk Employee

This should work, however I haven't tested in terms of DNS load balancing, only with a load balancer. Typically this is used with a load balancer. Typical configuration would look like a F5 lb with a VIP that points to 3 DS, with IP rules routing to the closest based on source IP. Theoretically it should work with dns round robin though, the load balancers dont do anything special in regards to changing the packets/TTL/headers etc..

Are the apps and serverclasses the same on all your DS? If there is a mismatch, it will trigger a reload..

Can you post your configs from your DS's?

0 Karma