As described in http://answers.splunk.com/answers/168693/forwarder-suddenly-stopped-sending-logs-appears-to.html, a splunk forwarder quit sending logs 9/11/2014 08:35:08 (time of last log entry received from it on the indexer). As are all my forwarders, this is given its configuration by the deployment monitor via a deployment application. There is a stanza on the deployment server in /opt/splunk/etc/system/local/serverclass.conf under [global]
called [serverClass:OIT_OUTPUT_9998]
. Thus all forwarders have in $SPLUNK_HOME/etc/apps/OIT_OUTPUT_9998/default the two files outputs.conf and server.conf. Without these files no SSL ports were available during splunk startup, which are required for the forwarder to send logs. And these files simply disappeared. In fact, on this forwarder -- which we will call Derek, the entire OIT_OUTPUT_9998 directory went missing. There is no evidence of crash files, and no evidence of a system crash. No one restarted splunk. But obviously it had to restart to lose the information it had from those configuration files after they went missing. After seeing the log file entries described in the (above-referenced) splunk answers question, I used /opt/splunk/bin/splunk cmd btool outputs list --debug
to see that information from the outputs.conf file were not in the running splunk forwarder's configuration.
So, then, that would just be a mystery to wonder about except for one thing: Another forwarder has had almost the exact same issue. We'll call this forwarder Andrew. Andrew quit sending logs at 9/11/2014 09:06:40:495 (or 16:06:40:495 GMT) – about a half-hour difference from when Derek did. Andrew is run by a completely different admin group -- nothing in common personnel-wise. Of course, a quick /opt/splunk/bin/splunk cmd btool outputs list --debug
verified this as the same problem. That is, Derek and Andrew had the same problem leading to not being able to send logs. However, in Andrew's case, the $SPLUNK_HOME/etc/apps/OIT_OUTPUT_9998 directory did not go missing... but the outputs.conf and server.conf files did.
And because I have begun to suspect something to do with the deployment server process, I've forced a reload to see if anything went amiss but nothing did.
And I just found that yet another forwarder -- Hail -- quit sending logs at 9/11/2014 10:28:55 AM. I bet I know what caused that, but I have not admin access to this system and can't verify that until I hear back from someone who does.
The Derek forwarder is on Splunk build=163460, version=5.0.3, os=SunOS, arch=sun4v
The Andrew forwarder is on Splunk build=149561, version=5.0.2, os=Windows, arch=x64
The Hail forwarder is on build=196940. Windows. Don't have version just now.
The deployment server is on Version=5.0.5 Build=179365 Product=splunk Platform=Linux-i386
This isn't strictly an answer per se, but you should examine the $SPLUNK_HOME/var/run/serverclass.xml file on each forwarder. This is a record of the classes that the client is a member of (as instructed by the deployment server), and the apps that came with that membership. The XML should be pretty straightforward to read.
If a host recorded in that file that it was supposed to have app X for serverclass S, and then the app mapping for serverclass S changed, so that it no longer included app X, the client will remove the app wholesale.
I propose that the power cut to the DS and subsequent power restore acted like a "reload deploy-server" command with unanticipated edits to the serverclass.conf.
An update... I have discovered that about 20 forwarders were affected in this way. But one I am working on now had all the deployment app directories removed. On this one I don't seem to be able to get the outputs working. Is there a way to manually unpack a bundle from the deployment server since it does not seem to be receiving it? Like ftp it there and use cli to unpack it?
Yes, absolutely. The next time the client "phones home" it should retrieve the content anew.
Note that there may be bundles (tar files) of the content already waiting on the client's disk under the $SPLUNK_HOME/var/run hierarchy. See the serverclass.xml on the client for details.
Only me and one other person (my boss) have that access to these files on the deployment server. Besides, they show update dates of last May.
But, I was reminded by my boss that on the 11th someone had inadvertently cut power to the deployment server the morning all this happened.
Have been looking into the hail forwarder's issues. In Hail's case the default directory was missing ($SPLUNK_HOME/etc/apps/OIT_OUTPUT_9998/default). So each of these three cases is different, though in the end the same important files were gone.
I would suggest that someone messed with the deployment server around this time and deployed the bogus changes???