With a simple systemd unit file you can tell systemd how to start and stop a Splunk instance, but if the Splunk instance is restarted outside of the systemd process (due to a cluster bundle push or a simple /opt/splunk/bin/splunk restart for example) it will fall out of management with systemd (systemctl status splunk will not return up to date information on the process).
This can lead to issues with management software like the inbuilt systemd watchdog or chef/puppet falsely believing the core splunkd process is down.
Instead of systemd monitor splunks pid file in forking mode, you can launch splunk direcly under systemd under simple mode, this will launch splunk as a service directly under systemd. You must be a bit careful though, once you use this, then you should only manage splunk start/stop via systemctl commands, otherwise if there is a restart lets say from UI, then splunk will start itself and also systemd will try to start it again and you may run into a race condition.
However I would recommend reconfiguring boot-start, that way splunk knows if its running under systemd, if you, more specifically if you issue splunk restart, splunk will know its running under systemd and internally call systemctl start
ExecStartPost=/bin/bash -c "chown -R : /sys/fs/cgroup/cpu/system.slice/%n"
ExecStartPost=/bin/bash -c "chown -R : /sys/fs/cgroup/memory/system.slice/%n"
I was struggling with the same issue you have. We use puppet, and
systemctl is-active splunk.service returned active when RemainAfterExit=yes was set, even if splunk has crashed, causing puppet not to restart it.
The solution seems to be as follows. Do not set RemainAfterExit=yes, so it will actually become inactive when splunk restarts after a rolling-restart. But to prevent puppet from messing things up when it tries to restart the process, add this to the unit file:
PIDFile=/opt/splunk/var/run/splunk/splunkd.pid, this will cause systemd to start tracking the newly created PID of the already running process from the splunk.pid file when puppet issues
systemctl restart splunk.service, without it trying to actually restart it.
The other thing I added, is
Restart=on-failure, this will cause splunk to start when the PID exited with a non 0 exit status (e.g. pkill splunk or crash).
Here is my Unit file:
[Unit] Description=Splunk indexer service Wants=network.target After=network.target Requires=thp-disable.service [Service] Type=forking Restart=on-failure ExecStart=/opt/splunk/bin/splunk start ExecStop=/opt/splunk/bin/splunk stop ExecReload=/opt/splunk/bin/splunk restart StandardOutput=syslog LimitNOFILE=65535 LimitNPROC=16384 TimeoutSec=300 PIDFile=/opt/splunk/var/run/splunk/splunkd.pid [Install] WantedBy=multi-user.target
[Unit] Description=Splunk After=network.service Wants=network.service [Service] Type=forking User=splunk Group=splunk TimeoutSec=200 RemainAfterExit=yes PIDFile=/opt/splunk/var/run/splunk/conf-mutator.pid ExecStart=/opt/splunk/bin/splunk start --answer-yes --no-prompt --accept-license ExecStop=/opt/splunk/bin/splunk stop ExecReload=/opt/splunk/bin/splunk restart StandardOutput=null LimitNOFILE=65536 [Install] WantedBy=multi-user.target
At first I thought this unit file with RemainAfterExit and PIDFile populated resolved the problem, however with further testing and studying the systemd documentation I've found it to be ineffective.
Due to the way systemd handles process execution (systemctl->cgroup->process), restarting the splunk service without using systemctl commands will drop the process out of management no matter if you set the PID file or not.
Right now I only see two options when running splunk through systemd unit files;
1) Run the unit file with RemainAfterExit=yes. This forces systemd to mark the process as active even after the tracked splunkd PID has exited. Unfortunately, this also means that if Splunk crashes the process is still marked as healthy.
2) Run the unit file without RemainAfterExit=yes (defaults to no). This means that if systemd sees the root splunkd process exit (even if it soon after restarts) it marks the service as down. This of course doesn't play nice with watchdog/puppet/chef etc.
To my understanding, for this to be resolved either systemd or Splunk would have to make significant codebase changes.
Even using the sysvinit compat layer (the default on RHEL7 installs where splunk enable boot-start is run) causes the same issue where the splunkd process restarting, stopping, or crashing causes systemd to loose track of the process state, marking it as "active (exited)" (seems to be using RemainAfterExit=yes like my unit file). I'm stumped.
In a cluster setup it is actually even worse: without RemainAfterExit=yes the servers will just shutdown and never come up again when you trigger a rolling restart (manually or as part of a bundle push). Clean exit code (0), no error or warning from splunkd. Systemd thinks the shutdown was intentional (active, exited).
We are running Splunk 6.3.3.