I am trying to troubleshoot where my issue lies in implementing my own SSL certificates to secure the deployment server to client configuration.
DS server.conf:
[sslConfig]
caCertFile = cacert.crt
caPath = $SPLUNK_HOME/etc/auth/myOrg
requireClientCert = false
sslKeysfile = splunk-ds.ser.cer
sslKeysfilePassword = <passwordhash>
sslVersions = tls, -tls1.0
Client server.conf:
[sslConfig]
caCertFile = cacert.crt
caPath = $SPLUNK_HOME/etc/apps/config_uf/auth
sslKeysfile = splunk-uf.ser.cer
sslKeysfilePassword = <password>
sslVersions = tls, -tls1.0
sslVerifyServerCert = true
sslCommonNameToCheck = splunk-ds.myorg.com
Now, it should be noted that my client is connecting to the deployment server by hostname, whereas the common name of the certificate is a DNS name. I have the FQDN listed under the Subject Alternative Name, and according to the documentation for 6.3 you cannot use the SAN list for deployment servers to client communication (I haven't tested this to see if it really doesn't work, that will be my next step).
What I am asking for is if there is a better way to troubleshoot the issue because the splunkd.log is entirely unhelpful as to the issue since this is all it is telling me from the client side:
11-18-2015 14:02:02.132 -0500 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
I have only done this to one of my clients, but I can't exactly sift through my deployment server logs very easily since there are over 12,000 systems hitting it. Any pointers? The common name provided in the client data IS the common name of the certificate used, it just isn't the hostname of the system.
Edit: It looks like this is all I can see from the Deployment Server right after it sends all the successful messages stating that the downloaded updates were completed you see a reset of the connection (due to the forwarder restarting) and then this:
11-18-2015 13:34:16.773 -0500 WARN HttpListener - Socket error from 10.10.175.64: Connection reset by peer
11-18-2015 14:01:38.920 -0500 WARN HttpListener - Connection from 10.10.175.64 didn't send us any data, disconnecting
11-18-2015 14:03:25.709 -0500 WARN HttpListener - Connection from 10.10.175.64 didn't send us any data, disconnecting
It is very odd that it takes a 30 minutes before it complains about not receiving any data, I get two messages, and then nothing further beyond that. This is also very unhelpful logs to identify the underlying issue.
Well, that is entirely annoying. I did a restart of my forwarder just so I could see if there was any way to get help from freshly restarted logs or something and this is what I got back:
11-18-2015 15:16:01.895 -0500 INFO DS_DC_Common - Initializing the PubSub system.
11-18-2015 15:16:01.895 -0500 INFO DS_DC_Common - Initializing core facilities of PubSub system.
11-18-2015 15:16:01.981 -0500 INFO DC:DeploymentClient - Starting phonehome thread.
11-18-2015 15:16:01.981 -0500 INFO DS_DC_Common - Deployment Client initialized.
11-18-2015 15:16:01.981 -0500 INFO ServerRoles - Declared role=deployment_client.
11-18-2015 15:16:01.981 -0500 INFO DS_DC_Common - Deployment Server not available on a dedicated forwarder.
11-18-2015 15:16:01.981 -0500 INFO DC:PhonehomeThread - Phonehome thread start, intervals: handshakeRetry=0 phonehome=300.
11-18-2015 15:16:01.981 -0500 INFO DC:PhonehomeThread - handshakeRetryInterval=60000 ms
11-18-2015 15:16:01.981 -0500 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
11-18-2015 15:17:02.086 -0500 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
11-18-2015 15:18:02.191 -0500 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
11-18-2015 15:18:07.255 -0500 INFO HttpPubSubConnection - SSL connection with id: connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:18:07.315 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:19:02.296 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:19:02.340 -0500 INFO DC:HandshakeReplyHandler - Handshake done.
11-18-2015 15:19:02.346 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:24:02.886 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:24:02.953 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
Way to just randomly start working. I am putting this in the answers section because my overall problem is fixed... somehow... but I'll not mark this as the accepted answer since I still don't know how to identify the underlying issue or what actually fixed it (other than an extra reboot I suppose?)
Edit: Well, I take it back, apparently I did actually edit out the outputs file and added in the SSLAlternativeName option to reference the hostname of the deployment server. So maybe this is just incorrectly listed in their spec file that it doesn't work? Because that is the only thing that makes sense to have been changed was the addition of that. I think the reboot just committed the change. I will toy with this some more tomorrow and see if that was the fix afterall. If so then the specification file should be updated to reflect that this is a valid setting for Deployment Clients/Servers.
Take a look at Duane Waddle's SSL Best Practices talk: slide, video
As a first step I would suggest testing via Splunk's OpenSSL on the command line from the UF endpoint to verify connectivity:
splunk cmd openssl s_client -connect <ip>:<port> -showcerts
This will dump the certs which you can then copy to a file and check manually:
splunk cmd openssl x509 -text -noout -in <file>.crt
Other gotchas:
- Splunk expects PEM certs and keys.
- If older than version 6.1, key file must be unencrypted, and key and cert must be in separate files.
- Splunkd expects key, cert, and root-cert all in one file
You can cut the noise in Splunk by searching for the specific IP or hostname that you are testing with, limited to the _internal index. Since nothing is being sent from the endpoint, log into it and tail the logs (*NIX) or use event viewer (Windows).
Thank you, yes, I was using his SSL slides to assist with this. This is actually the last thing I think I needed to get working before I am completed with getting everything switched to custom certs.
So the certificates aren't the issue, per se, as they work with everything else (forwarding and web connections).
I also could totally pull the certificates using the s_client command over port 8089 and it lists the common name as I expected it to.
Well, that is entirely annoying. I did a restart of my forwarder just so I could see if there was any way to get help from freshly restarted logs or something and this is what I got back:
11-18-2015 15:16:01.895 -0500 INFO DS_DC_Common - Initializing the PubSub system.
11-18-2015 15:16:01.895 -0500 INFO DS_DC_Common - Initializing core facilities of PubSub system.
11-18-2015 15:16:01.981 -0500 INFO DC:DeploymentClient - Starting phonehome thread.
11-18-2015 15:16:01.981 -0500 INFO DS_DC_Common - Deployment Client initialized.
11-18-2015 15:16:01.981 -0500 INFO ServerRoles - Declared role=deployment_client.
11-18-2015 15:16:01.981 -0500 INFO DS_DC_Common - Deployment Server not available on a dedicated forwarder.
11-18-2015 15:16:01.981 -0500 INFO DC:PhonehomeThread - Phonehome thread start, intervals: handshakeRetry=0 phonehome=300.
11-18-2015 15:16:01.981 -0500 INFO DC:PhonehomeThread - handshakeRetryInterval=60000 ms
11-18-2015 15:16:01.981 -0500 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
11-18-2015 15:17:02.086 -0500 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
11-18-2015 15:18:02.191 -0500 INFO DC:DeploymentClient - channel=tenantService/handshake Will retry sending handshake message to DS; err=not_connected
11-18-2015 15:18:07.255 -0500 INFO HttpPubSubConnection - SSL connection with id: connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:18:07.315 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:19:02.296 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:19:02.340 -0500 INFO DC:HandshakeReplyHandler - Handshake done.
11-18-2015 15:19:02.346 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:24:02.886 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
11-18-2015 15:24:02.953 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.10.175.64_8089_myhost.myorg.com_myhost_4DADA81E-18EF-4D2B-9B5F-DB3F71ECB0AD
Way to just randomly start working. I am putting this in the answers section because my overall problem is fixed... somehow... but I'll not mark this as the accepted answer since I still don't know how to identify the underlying issue or what actually fixed it (other than an extra reboot I suppose?)
Edit: Well, I take it back, apparently I did actually edit out the outputs file and added in the SSLAlternativeName option to reference the hostname of the deployment server. So maybe this is just incorrectly listed in their spec file that it doesn't work? Because that is the only thing that makes sense to have been changed was the addition of that. I think the reboot just committed the change. I will toy with this some more tomorrow and see if that was the fix afterall. If so then the specification file should be updated to reflect that this is a valid setting for Deployment Clients/Servers.
If you it turns out that is the fix, please leave docs feedback on the spec page so the documentation team can update, and edit your answer with the details. 🙂
Yeah, looks to be that way, having messed with it a couple times. The alternative name was what was causing the issues. I have accepted my own answer for that reason and will make a comment on the docs page.