A forwarder just up and quit sending logs to my indexer one morning last week. I did not notice until Monday (yesterday) afternoon. I asked the admin there to restart splunkd and when that did not correct the issue, I asked him to send me the logs. I saw several things that don't look good. The first was the second of these two back-to-back forwarder splunkd log entries:
09-15-2014 16:30:31.998 -0700 WARN DeploymentClient - Phonehome thread is now started. 09-15-2014 16:30:31.998 -0700 WARN DeploymentClient - Unable to send handshake
The second was in the part of the forwarder splunkd logs where you normally see the TCP ports being initialized:
09-15-2014 16:30:33.165 -0700 INFO TcpInputConfig - SSL clause not found or servercert not provided - SSL ports will not be available
And then on the indexer I found this in the splunkd.log:
09-15-2014 15:12:40.669 -0700 ERROR TcpInputProc - Error encountered for connection from src=128.200.xxx.xxx:49266.error:140760FC:SSL routines:SSL23_GET_CLIENT_HELLO:unknown protocol 09-15-2014 15:18:17.960 -0700 ERROR TcpInputProc - Error encountered for connection from src=128.200.xxx.xxx:52373.error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
Any ideas why this has suddenly started happening?
Good idea, but I just updated the SSL cert on my indexer and all my forwarders last May. They all have the same server cert, so it does not explain this one forwarder. I just got the forwarder release from the admin that runs it:
#cat /opt/splunk/etc/splunk.version VERSION=5.0.3 BUILD=163460 PRODUCT=splunk PLATFORM=SunOS-sparcv9
I also had him to a ' splunk cmd btool outputs list --debug' and all I see in the email he sent me back was lines from /opt/splunk/etc/system/default/outputs.conf and nothing from the deployment app outputs.conf at all. That is very bizarre.
Well, it turns out that the /opt/splunk/etc/apps/OITOUTPUT9998 went missing for whatever reason, and that contains the server.conf and outputs.conf in the default directory. The splunk forwarder apparently crashed at some point and those files were not there for it to read. The question is, why did it go away? That's a mystery. This is spec'd-out in the global section of the serverclass.conf file, so every forwarder described there should have a copy of it. #mystery
The mystery deepens. There is no evidence of crash files, and no evidence of system crash or reboot. There were no configuration changes made to the deployment application for either of these systems, so nothing to make splunk forwarder restart. Yet it had to have in order to notice the missing server.conf and outputs.conf files which were in /opt/splunk/etc/apps/OITOUTPUT9998/default.
@sowings proposes that the power cut to the Deployment Server and subsequent power restore acted like a "reload deploy-server" command with unanticipated edits to the serverclass.conf (or to the application bundles). It turns out that some twenty forwarders were affected by this, with various parts of their deployed configurations missing. Port 9997 is not defined on any of these forwarders; only 9998. But as that is defined in the deployed application, loss of the outputs.conf caused the forwarder to be unable to establish a connection with the deployment server/indexer. The solution was to delete whatever was left in /opt/splunk/var/run and /opt/splunk/etc/apps and then define the outputs in /opt/splunk/etc/system/local/outputs.conf
As soon as the Splunk forwarder was restarted it contacted the DS and got its bundle and everything was fine.