Deployment Architecture

Splunk DC:DeploymentClient err=not_connected

matthewssa
Path Finder

Hello!

I have the following problem :
A deployment server lost connectivity to all of its clients. If I change the phonehome interval for one of the clients to any value between 30 to 100 it will eventually connect again. I was wonder if anyone had any thought what would cause this? I can repeat the issue by changing the value back to 300 and it would break the connection again.

DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected

deploymentclient.conf

[deployment-client]
phoneHomeIntervalInSecs = 300

[target-broker:deploymentServer]
targetUri = x.x.x.x:8089

obw1r3d
Engager

Just wanted to post to say that I'm experiencing the exact same behavior. Set phoneHomeIntervalInSecs to 600, I get the "err=not_connected" message. If I change it back to 60, it'll work again with no issues.

I also tried playing around with setting handshakeRetryIntervalInSecs to a low value (since the docs mention that it is set to "one fifth of phoneHomeIntervalInSecs") but no dice.

0 Karma

ivanreis
Builder

Troubleshooting Forwarding Problem:
Is the management port on the receiver enabled? management port(default to 8089)
- you can run a telnet or tcpdump to this port to check the connectivity

Is a firewall blocking? the firewall should be release on the two way connection from deployment server to UF

Show all the deployment client messages from the client
index=_internal component=DC* host=yourufname | stats count by message

Show all the deployment messages on the deployment server:
index=_internal component=DS* host=yourdeployementsever | stats count by message

It seems you are having network connection issues. There is similar issue on this answer -> https://answers.splunk.com/answers/488375/how-to-resolve-errnot-connected-error-in-deploymen.html

matthewssa
Path Finder

Thanks for the reply!

I took a look at some of those searches to look for additional messages.

I don't think it would be the firewall, because if I change the interval to 30 it can eventually connect to the DS and shows up in the Forwarder Management. I still double checked though and see no blocks and the port is also added in firewalld.

For the deployment server side I didn't get any messages from that that search.

For the client side I saw the following messages
- Attempted handshake xxx times. Will try to re-subscribe to handshake reply
- Phonehome thread start, intervals: handshakeRetry=60 phonehome=300.0
- channel=deploymentServer/phoneHome/default Will retry sending phonehome to DS; err=not_connected
- channel=tenantService/handshake Will retry handshake message to DS; err=not_connected

Also I saw some messages that look related.
HTTPPubSubConnection - Unable to parse message from PubSubSvr:
Could no obtain connection, will retry after=xxx.xxx seconds.

I did a tcpdump and made two different pcaps to look at in wireshark and I kinda wanna say this looks like the client is sending resets before the TLS connection could be finished? Is that what is happening here?

interval set to 300 (Bad connection to the ds)
client SYN
ds SYN, ACK
client ACK
client TLSv1.2 Client Hello
ds ACK
ds Server Hello, Certificate, Server Hello Done
client ACK
client TLSv1.2 Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
ds TLSv1.2 New Session Ticket, Change Cipher Spec, Encrypted Handshake Message
client TLSv1.2 Application Data
ds ACK
client FIN, ACK
ds ACK
ds TLSv1.2 Application Data
client RST
ds TLSv1.2 Application Data
client RST
ds TLSv1.2 Encrypted Alert
client RST
ds FIN, ACK
client RST

interval set to 30 (Good connection to ds)
client SYN
ds SYN, ACK
client ACK
client TLSv1.2 Client Hello
ds ACK
ds Server Hello, Certificate, Server Hello Done
client ACK
client TLSv1.2 Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
ds TLSv1.2 New Session Ticket, Change Cipher Spec, Encrypted Handshake Message
client TLSv1.2 Application Data
ds TLSv1.2 Application Data
ds TLSv1.2 Application Data
client ACK
client FIN, ACK
ds TLSv1.2 Encrypted Alert
client RST
ds FIN, ACK
client RST

After all that I went through and started verifying the cipherSuites and sslVersions between the client and ds for web.conf and server.conf which both are using splunks default values.

Verified also the date on each server because I saw that could be another issue when dealing with TLS connections.

0 Karma

santhoshi
Explorer

Hi Matthew,

 

I am receiving the same errors as you, in splunkd.log of the UF.  Things were working fine till today, but logs stopped getting indexed today after the IP of the server in which UF is installed got changed. Is your issue resolved? if so, could you please explain what changes you made which fixed the issue.

0 Karma

matthewssa
Path Finder

I'm sorry I still have not found the resolution to my issue. I was able to dig deeper at one point and saw timeout messages in the Splunk internal logs. I would also see timeouts when going to the Splunk webpage or any other tools webpage. For some reason though if I move the deployment server to sit outside of our firewall and change the physical ip address of the deployment server to the NAT that was being used on the firewall. All of the Splunk agents can suddenly connect. I believe this to be a network issue, but we have yet to figure out what it is. When I pulled some pcaps I saw every other line was one of the following. TCP Dup ACK, TCP Retransmissions, or TCP Out-of-Order

0 Karma

ivanreis
Builder

I never played with such configuration before, so try one of those parameters at deployment.conf

handshakeRetryIntervalInSecs =
* This sets the handshake retry frequency, in seconds.
* Could be used to tune the initial connection rate on a new server
* Default: One fifth of 'phoneHomeIntervalInSecs'

handshakeReplySubscriptionRetry =
* If splunk is unable to complete the handshake, it will retry subscribing to
the handshake channel after this many handshake attempts
* Default: 10

appEventsResyncIntervalInSecs =
* This sets the interval at which the client reports back its app state
to the server.
* Fractional seconds are allowed.
* Default: 10 * 'phoneHomeIntervalInSecs'

Are you able to deploy apps to those UF clients?

Get Updates on the Splunk Community!

New Dates, New City: Save the Date for .conf25!

Wake up, babe! New .conf25 dates AND location just dropped!! That's right, this year, .conf25 is taking place ...

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...

Observability protocols to know about

Observability protocols define the specifications or formats for collecting, encoding, transporting, and ...