After upgrading a distributed Splunk Enterprise environment from 9.0.5 to 9.1.1 a lot of issues observed. The most pressing one was the unexpected wiping of all input.conf and output.conf files from heavy forwarders.
All configuration files are still present and intact on the deployment server, though after unpacking the updated version and bringing Splunk back up on the heavy forwarders, all input/output files were wiped from all apps and are not being fetched from the deployment server. So none of them were listening for incoming traffic or forwarding to indexers.
Based on previous experience, there is no way to "force push" configuration from the deployment server when all instances are "happy", which means manual inspection and repair of all affected apps.
So now I am curious as to why this happened? If there was something wrong with the configuration I'd expect there to be some errors thrown and not just having the entire files deleted. Any input regarding why this happend, how to find out would be appreciated.
UPDATE:
So by now it is very clear what happened, a bunch of default folder were simply deleted afterduring the update, there are a few indications of this in different log files.
11-08-2023 12:21:19.816 +0100 INFO AuditLogger - Audit:[timestamp=11-08-2023 12:21:19.816, user=n/a, action=delete-parent,path="/opt/splunk/etc/apps/<appname>/default/inputs.conf"
This was unfortunate as the deploymentclient.conf file was stored in <appname>/default and got erased together with almost all input/output.conf and a bunch of other things stored in the default folder.
I don't get the impression that this is expected behaviour, so now I am curious regarding the cause of this highly strange outcome.
The deployment server seems to have come up in a bad state with random read access errors for some files. That ment that some folders were just not fetched from the deployment server. Once we got the server back to a fully funcitonal state the observed issues were resolved.
The deployment server seems to have come up in a bad state with random read access errors for some files. That ment that some folders were just not fetched from the deployment server. Once we got the server back to a fully funcitonal state the observed issues were resolved.