Deployment Architecture

How to troubleshoot why a deployment client is unable to phone home to the deployment server?

New Member

We are unable to get the deployment client to show in the deployment console. Other Windows/Linux servers are connected and apps are being distributed fine.

Deployment Client:

  • Windows 2012 x64
  • Splunk version 6.2.4

Deployment server:

  • oel 6 x64
  • splunk version 6.2.0

We have validated that the client can telnet to the deployment server on the correct port. We were able to see the TCP transaction on both sides and enabled debug logging on the client and deployment server. Deployment server has no entry regarding the client.
Client splunkd.log

08-12-2015 16:33:03.791 -0700 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=Initial
08-12-2015 16:33:03.791 -0700 DEBUG DC:PhonehomeThread - Attempting handshake
08-12-2015 16:33:03.791 -0700 DEBUG DC:DeploymentClient - Sending message <handshake/> to tenantService/handshake
08-12-2015 16:33:03.791 -0700 INFO  DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
08-12-2015 16:33:03.791 -0700 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.0sec
08-12-2015 16:33:03.791 -0700 DEBUG DC:PhonehomeThread - Phonehome thread will wait for 12.0sec (1)
0 Karma

Path Finder

Another possible solution to this issue:
After upgrading our Splunk deployment manager from 6.4.1 to 7.3.2 I noticed I was getting this for many, but not all, of our forwarders.
All the network connectivity checks seemed to be okay. Nothing had been changed in the serverclass.conf file. Instead it turned out to be an SSL issues detailed in this bug: SPL-141961 "Older 6.0, 6.1, 6.2, 6.3 maintenance release versions unable to connect to 6.6.x and later via management port"

The true fix is to upgrade all the forwarders. But the quick solution is to change the SSL config on the deployment server in the server.conf file by adding the following to the 'sslConfig' stanza:

sslVersions = *,-ssl2
sslVersionsForClient = *,-ssl2
cipherSuite = TLSv1+HIGH:TLSv1.2+HIGH:@STRENGTH

Then restart the deployment server and you should find your missing forwarders are now able to talk to it again.

A bit more detail on this can be found here on the Splunk website: https://docs.splunk.com/Documentation/Splunk/7.2.3/ReleaseNotes/KnownIssues (do a search for the issue number: SPL-141961)

Path Finder

Hi does anybody has solution for the above issue (HttpPubSubConnection - Unable to parse message from PubSubSvr: )

I have the same issue, the other apps from the same client is deploying to DS but except the PubSubSvr.

Communicator

please check the targeturi. I got the same error and it got resolved by editing the correct uri in the deploymentclient.conf

0 Karma

Engager

Hello,

I am getting same error after restart splunkd service on windows machine.

Please suggest

5-25-2018 20:26:54.240 +0100 WARN HttpPubSubConnection - Unable to parse message from PubSubSvr:
05-25-2018 20:26:54.240 +0100 INFO HttpPubSubConnection - Could not obtain connection, will retry after=60.616 seconds.
05-25-2018 20:26:59.876 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
05-25-2018 20:27:11.877 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
05-25-2018 20:27:23.878 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
05-25-2018 20:27:35.878 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
05-25-2018 20:27:47.878 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected,Hello,

I am getting below error can anyone tail me how to resolve this

5-25-2018 20:26:54.240 +0100 WARN HttpPubSubConnection - Unable to parse message from PubSubSvr:
05-25-2018 20:26:54.240 +0100 INFO HttpPubSubConnection - Could not obtain connection, will retry after=60.616 seconds.
05-25-2018 20:26:59.876 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
05-25-2018 20:27:11.877 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
05-25-2018 20:27:23.878 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
05-25-2018 20:27:35.878 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
05-25-2018 20:27:47.878 +0100 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected

Path Finder

Having this same issue, under splunkforwarder/etc/apps/deploymentclient/metadata/ , i found a additional file which was causing the problem ._local.meta. removed it and restarted my splunk and problem resolved.

Splunk Employee
Splunk Employee

please accept one of the answers if it solved your problem, @seymouj

0 Karma

Path Finder

It resolved my issue by flushing iptables, Hope below command works for you too.

iptables -F

cheers

Contributor

This line looks suspicious:

> 08-14-2015 11:34:17.153 -0700 WARN  HttpPubSubConnection - HTTP client error in http pubsub Read Timeout uri=<a href="http://10.156.101.127">http://10.156.101.127</a>:2000/services/broker/connect/0A43BEC6-915B-488E-A60B-8241F1680FAF/IODWAPP242/271043/windows-x64/8089

Did you check if you have in the deploymentclient.conf set the correct targetUri? Sometimes a simple syntax error can cause this. Could you post the client config here, as well as the deploymentserver.conf?

Splunk Employee
Splunk Employee

Before you start looking at TCP Dumps, can you confirm you have full network connectivity from the host to the DS? You'll need TCP to the DS on 8089 (Unless you changed the management port.) And also the ability to open dynamic ports for the download of the data from the DS to the Client.
Additionally, make sure you have a serverclass defined with and app in it for the client you are trying to connect with.

New Member

Having this same issue myself trying to add deployment client functionality to existing heavy forwarders. In fact, when I run tcpdumps, I see this error messge in the logs:

"channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected"

When there are ZERO packets that have gone across the wire. The error is appearing without the DC even attempting to contact the server.

This is with Splunk Enterprise v6.2.2 with the deployment server running on a cluster master. Not concerned about performance here; this is on a dev box just to ahem prove that deployment server works.

0 Karma

Contributor

As with the problem described above...check your connection. telnet, ping, firewall settings, syntax in the config files.

0 Karma

Motivator

Missing serverclass is not going to cause the handshake to fail, surely? Unless things have changed in V6 it will just result in no class matches and hence an empty configuration deployment. The handshake will still complete.

A tcpdump is a very quick and direct method of answering a whole bundle of fundamental network questions by direct observation and without the need for any circumstantial inference, before tinkering with configurations. You will know whether the packets are getting through, whether they are complete, the exact nature of the response if they are. There is absolutely no point in refining configurations if the fault lies on one end not talking correctly to the other.

0 Karma

Contributor

+1 to what @Esix said.

Additionally, there are times when firewalls and auth/transparent proxies play evil and restrict the connection.

0 Karma

Motivator

They already mentioned that they have connectivity with a quick Telnet test. Admittedly I am assuming that the phrase "to the correct port" means what it says.

0 Karma

Motivator

[Edited to preface with the caveat that I am assuming your initial telnet test is correctly framed, and that your idea of "the correct port" is 8089.]

Are you using SSL? Is it correctly configured?

The only real way to judge here is to run comparative tcpdumps from a working machine (preferably one in the same routed network zone - assuming there is firewalling and segragation going on here) and for the one which is failing (which since it is a 'Doze box would require a Linux installation receiving duplicate packets from the switch).

You could also tcpdump on/for the deployment server, to see if something hooky is going on there.

0 Karma

New Member

I went ahead and added enableSplunkdSSL = false to the server.conf file on both the deployment server and the client. This should remove any issues with SSL. The issue still persists.

Client splunkd.log


08-14-2015 11:34:06.132 -0700 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=Initial
08-14-2015 11:34:06.132 -0700 DEBUG DC:PhonehomeThread - Attempting handshake
08-14-2015 11:34:06.132 -0700 DEBUG DC:DeploymentClient - Sending message to tenantService/handshake
08-14-2015 11:34:06.132 -0700 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
08-14-2015 11:34:06.132 -0700 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.0sec
08-14-2015 11:34:06.132 -0700 DEBUG DC:PhonehomeThread - Phonehome thread will wait for 12.0sec (1)
08-14-2015 11:34:17.153 -0700 WARN HttpPubSubConnection - HTTP client error in http pubsub Read Timeout uri=http://10.156.101.127:2000/services/broker/connect/0A43BEC6-915B-488E-A60B-8241F1680FAF/IODWAPP242/2...
08-14-2015 11:34:17.153 -0700 WARN HttpPubSubConnection - Unable to parse message from PubSubSvr:
08-14-2015 11:34:17.153 -0700 INFO HttpPubSubConnection - Could not obtain connection, will retry after=71 seconds.
08-14-2015 11:34:18.132 -0700 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=Initial
08-14-2015 11:34:18.132 -0700 DEBUG DC:PhonehomeThread - Attempting handshake
08-14-2015 11:34:18.132 -0700 DEBUG DC:DeploymentClient - Sending message to tenantService/handshake
08-14-2015 11:34:18.132 -0700 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
08-14-2015 11:34:18.132 -0700 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.0sec
08-14-2015 11:34:18.132 -0700 DEBUG DC:PhonehomeThread - Phonehome thread will wait for 12.0sec (1)
08-14-2015 11:34:30.133 -0700 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=Initial
08-14-2015 11:34:30.133 -0700 DEBUG DC:PhonehomeThread - Attempting handshake
08-14-2015 11:34:30.133 -0700 DEBUG DC:DeploymentClient - Sending message to tenantService/handshake
08-14-2015 11:34:30.133 -0700 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
08-14-2015 11:34:30.133 -0700 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.0sec
08-14-2015 11:34:30.133 -0700 DEBUG DC:PhonehomeThread - Phonehome thread will wait for 12.0sec (1)

0 Karma

Motivator

In that case - and assuming that your previously mentioned telnet test was correctly framed to the right port - I fall back to the suggestion of a tcpdump to analyse the actual network traffic at the deployment server.

0 Karma