Installation

Issue with Deployment Manager connections after upgrading to 6.1.3.

mikecee
Explorer

Hi all,

Am having trouble with Deployment clients, which seems to have started after an upgrade to 6.1.3. The symptoms are that deployment clients which were previously working just fine have stopped picking up application changes, and are no longer logging to their forwarders. I'm thinking this could be an SSL issue (the upgrade process here does a re-install from scratch for clients), but would welcome any pointers or suggestions for things to try.

The Deployment Server (which is also an Indexer and Search Head -- it's a small application!) is seeing the following in splunkd.log:

09-09-2014 11:36:32.594 +1000 WARN  PubSubSvr - sender=connection_10.16.X.Y_8089_hostname.domain.tld_hostname_4AD3FF55-2746-1234-BB83-FE0EAB41B309 channel=deploymentServer/phoneHome/default Message not dispatched (connection invalid)

The deployment clients are seeing the following messages in splunkd.log:

09-05-2014 10:55:03.507 +1000 WARN  DC:PhonehomeThread - No response to handshake for too long; starting over.
09-05-2014 10:55:03.507 +1000 WARN  DC:PhonehomeThread - No response to handshake for too long; starting over.
09-05-2014 10:55:27.623 +1000 WARN  DC:PhonehomeThread - No response to handshake for too long; starting over.
09-05-2014 10:55:27.623 +1000 WARN  DC:PhonehomeThread - No response to handshake for too long; starting over.
09-05-2014 10:55:51.674 +1000 WARN  DC:PhonehomeThread - No response to handshake for too long; starting over.
09-05-2014 10:55:51.674 +1000 WARN  DC:PhonehomeThread - No response to handshake for too long; starting over.
09-05-2014 10:55:51.792 +1000 INFO  DC:HandshakeReplyHandler - Handshake done.
09-05-2014 10:55:51.799 +1000 WARN  PubSubConnection - Cannot convert str: error to a valid status, returning eRejected.
09-05-2014 10:56:51.907 +1000 INFO  DC:HandshakeReplyHandler - Handshake done.
09-05-2014 11:26:54.472 +1000 INFO  NetUtils - Error in connection() 111 - Connection refused
09-05-2014 11:28:54.751 +1000 INFO  DC:DeploymentClient - channel=deploymentServer/phoneHome/default Will retry sending phonehome to DS; err=not_connected
09-05-2014 11:29:54.752 +1000 INFO  DC:DeploymentClient - channel=deploymentServer/phoneHome/default Will retry sending phonehome to DS; err=not_connected
09-05-2014 11:29:54.815 +1000 INFO  HttpPubSubConnection - SSL connection with id: connection_10.16.X.Y_8089_hostname.domain.tld_hostname_4AD3FF55-2746-1234-BB83-FE0EAB41B309
09-05-2014 11:29:54.821 +1000 WARN  PubSubConnection - Cannot convert str: error to a valid status, returning eRejected.
09-05-2014 11:29:54.821 +1000 WARN  HttpPubSubConnection - Batch subscribe aborted as status is not eOk`

(host names/GUIDs manually modified to protect the guilty 🙂

heeeelp!

Mike

Labels (1)
Tags (3)
1 Solution

mikecee
Explorer

Well that was an interesting ride. Multiple problems. Unhilarity ensued.

The first problem was two changes done at once; the clients in question were upgraded at around the same time some (broken) optimisations were made to serverclass.conf. It turns out that the optimisations weren't that optimal; earlier whitelist entries which were too broad were removed in favor of later whitelist entries, which were broken (and this brokeness was masked by the earlier entries). e.g.

whitelist.0 = lab*.example.com
whitelist.1 = prod*.example.com
whitelist.2 = 10.27.0.0/16
whitelist.3 = 10.193.7.0/24

Once this was repaired, there was a confusing problem with an App that I was trying to include in two different (mutually exclusive in the real-world) classes. The Deployment manager seemed to have problems with that...

View solution in original post

0 Karma

mikecee
Explorer

Well that was an interesting ride. Multiple problems. Unhilarity ensued.

The first problem was two changes done at once; the clients in question were upgraded at around the same time some (broken) optimisations were made to serverclass.conf. It turns out that the optimisations weren't that optimal; earlier whitelist entries which were too broad were removed in favor of later whitelist entries, which were broken (and this brokeness was masked by the earlier entries). e.g.

whitelist.0 = lab*.example.com
whitelist.1 = prod*.example.com
whitelist.2 = 10.27.0.0/16
whitelist.3 = 10.193.7.0/24

Once this was repaired, there was a confusing problem with an App that I was trying to include in two different (mutually exclusive in the real-world) classes. The Deployment manager seemed to have problems with that...

0 Karma

hortonew
Builder

Are they all failing, or only some?

0 Karma
Get Updates on the Splunk Community!

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Hey Splunky People! We are excited to share the latest updates in Splunk Enterprise 9.4. In this release we ...

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

SignalFlow: What? Why? How?

What is SignalFlow? Splunk Observability Cloud’s analytics engine, SignalFlow, opens up a world of in-depth ...