Splunk Search

Systemd unit with pid tracking for Splunk

mwirth
Explorer

With a simple systemd unit file you can tell systemd how to start and stop a Splunk instance, but if the Splunk instance is restarted outside of the systemd process (due to a cluster bundle push or a simple /opt/splunk/bin/splunk restart for example) it will fall out of management with systemd (systemctl status splunk will not return up to date information on the process).

This can lead to issues with management software like the inbuilt systemd watchdog or chef/puppet falsely believing the core splunkd process is down.

dimrirahul
Explorer

Instead of systemd monitor splunks pid file in forking mode, you can launch splunk direcly under systemd under simple mode, this will launch splunk as a service directly under systemd. You must be a bit careful though, once you use this, then you should only manage splunk start/stop via systemctl commands, otherwise if there is a restart lets say from UI, then splunk will start itself and also systemd will try to start it again and you may run into a race condition.

However I would recommend reconfiguring boot-start, that way splunk knows if its running under systemd, if you, more specifically if you issue splunk restart, splunk will know its running under systemd and internally call systemctl start

[Service]
Type=simple
Restart=always
ExecStart=/opt/splunk/bin/splunk _internal_launch_under_systemd
LimitNOFILE=65536
SuccessExitStatus=51 52
RestartPreventExitStatus=51
RestartForceExitStatus=52
User=
Delegate=true
MemoryLimit=100G
CPUShares=1024
PermissionsStartOnly=true
ExecStartPost=/bin/bash -c "chown -R : /sys/fs/cgroup/cpu/system.slice/%n"
ExecStartPost=/bin/bash -c "chown -R : /sys/fs/cgroup/memory/system.slice/%n"

0 Karma

rabbidroid
Path Finder

I was struggling with the same issue you have. We use puppet, and systemctl is-active splunk.service returned active when RemainAfterExit=yes was set, even if splunk has crashed, causing puppet not to restart it.

The solution seems to be as follows. Do not set RemainAfterExit=yes, so it will actually become inactive when splunk restarts after a rolling-restart. But to prevent puppet from messing things up when it tries to restart the process, add this to the unit file: PIDFile=/opt/splunk/var/run/splunk/splunkd.pid, this will cause systemd to start tracking the newly created PID of the already running process from the splunk.pid file when puppet issues systemctl restart splunk.service, without it trying to actually restart it.

The other thing I added, is Restart=on-failure, this will cause splunk to start when the PID exited with a non 0 exit status (e.g. pkill splunk or crash).

Here is my Unit file:

[Unit]
Description=Splunk indexer service
Wants=network.target
After=network.target
Requires=thp-disable.service

[Service]
Type=forking

Restart=on-failure
ExecStart=/opt/splunk/bin/splunk start
ExecStop=/opt/splunk/bin/splunk stop
ExecReload=/opt/splunk/bin/splunk restart
StandardOutput=syslog
LimitNOFILE=65535
LimitNPROC=16384
TimeoutSec=300
PIDFile=/opt/splunk/var/run/splunk/splunkd.pid

[Install]
WantedBy=multi-user.target

woodcock
Esteemed Legend

You should click Accept on this answer to close the question.

0 Karma

mwirth
Explorer
  [Unit]
    Description=Splunk
    After=network.service
    Wants=network.service

    [Service]
    Type=forking
    User=splunk
    Group=splunk
    TimeoutSec=200
    RemainAfterExit=yes
    PIDFile=/opt/splunk/var/run/splunk/conf-mutator.pid
    ExecStart=/opt/splunk/bin/splunk start --answer-yes --no-prompt --accept-license
    ExecStop=/opt/splunk/bin/splunk stop
    ExecReload=/opt/splunk/bin/splunk restart
    StandardOutput=null
    LimitNOFILE=65536

    [Install]
    WantedBy=multi-user.target

EDIT:
At first I thought this unit file with RemainAfterExit and PIDFile populated resolved the problem, however with further testing and studying the systemd documentation I've found it to be ineffective.
Due to the way systemd handles process execution (systemctl->cgroup->process), restarting the splunk service without using systemctl commands will drop the process out of management no matter if you set the PID file or not.

Right now I only see two options when running splunk through systemd unit files;

1) Run the unit file with RemainAfterExit=yes. This forces systemd to mark the process as active even after the tracked splunkd PID has exited. Unfortunately, this also means that if Splunk crashes the process is still marked as healthy.

2) Run the unit file without RemainAfterExit=yes (defaults to no). This means that if systemd sees the root splunkd process exit (even if it soon after restarts) it marks the service as down. This of course doesn't play nice with watchdog/puppet/chef etc.

To my understanding, for this to be resolved either systemd or Splunk would have to make significant codebase changes.

Even using the sysvinit compat layer (the default on RHEL7 installs where splunk enable boot-start is run) causes the same issue where the splunkd process restarting, stopping, or crashing causes systemd to loose track of the process state, marking it as "active (exited)" (seems to be using RemainAfterExit=yes like my unit file). I'm stumped.

dschregenberger
Engager

In a cluster setup it is actually even worse: without RemainAfterExit=yes the servers will just shutdown and never come up again when you trigger a rolling restart (manually or as part of a bundle push). Clean exit code (0), no error or warning from splunkd. Systemd thinks the shutdown was intentional (active, exited).

We are running Splunk 6.3.3.

Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...