When splunk does not shut down gracefully (through system crash, application crash, etc), a PID file is left behind. The presence of such a PID file prevents splunk from starting. To get splunk to start again, you have to manually delete the PID file.
What is the benefit of this PID file?
Are there any best practices/manual checks that should be performed prior deleting the PID file and restarting splunk? How do best practices differ among splunk instances having universal forwarder vs. receiver vs. indexer vs. search head roles?
Is there any reason why splunk itself shouldn't be enhanced to check for the presence of such PID files at startup and delete them if present?
In the mean time, I figure I'll integrate a cleanup function into host startup scripts (if splunk enabled and not started and pid file present, then delete pid file and restart agent, else noop).
This is a follow on thought from recent experience and a related question/answer.
To answer your first question first, the pid files act to avoid data loss/corruption in the event of multiple programs believing they are in charge of aspects of the install at any given time. Bad scenarios would be things like two copies of splunk modifying the file-tracking database, or the command line editing a conf file at the same time as the splunkd service/server is modifying the same file.
As for why it breaks, the problems occur when the pid file refers to a process ID that is actually running (even if it is not splunk).
There are actually two problems.
A problem can occur with conf-mutator.pid
which attempts to provide exclusion between conf-modification by the command line (splunk.exe) and conf-ownership by the running services (splunkd). When this file is left stale (crash or power loss etc) and colliding with a running program, it will cause splunkd launch to fail early in its setup/bringup. This can occur on both UNIX and Windows, and it's far more common to occur on Windows due to its frequent PID recycling. The Windows manifestation of this problem was fixed in 6.1.3 by checking to see the name of the program (if the program is not 'splunk.exe' or 'splunkd.exe', we now consider the pid-lock to be invalid and delete it). The UNIX manifestation of this problem was not fixed until 6.1.4/not-yet-released 6.2
A problem can also occur with splunkd.pid
which serves a few purposes tracking splunkd and all its subprocesses. One of its purposes is the avoidance of running multiple copies of splunkd on the same SPLUNK_HOME which could be disasterous. If this is left stale and pointing at a currently running PID, it will also prevent startup, although the test is done in the launcher, splunk.exe, before splunkd is ever started. This problem can only occur on UNIX. (On Windows, this is equivalently handled by the service manager and there is no splunkd.pid.) In 6.1.4 and the not-yet-released 6.2 we again perform some checks to see if it is really a valid splunkd. (It took a lot more effort than you might expect.)
To answer your first question first, the pid files act to avoid data loss/corruption in the event of multiple programs believing they are in charge of aspects of the install at any given time. Bad scenarios would be things like two copies of splunk modifying the file-tracking database, or the command line editing a conf file at the same time as the splunkd service/server is modifying the same file.
As for why it breaks, the problems occur when the pid file refers to a process ID that is actually running (even if it is not splunk).
There are actually two problems.
A problem can occur with conf-mutator.pid
which attempts to provide exclusion between conf-modification by the command line (splunk.exe) and conf-ownership by the running services (splunkd). When this file is left stale (crash or power loss etc) and colliding with a running program, it will cause splunkd launch to fail early in its setup/bringup. This can occur on both UNIX and Windows, and it's far more common to occur on Windows due to its frequent PID recycling. The Windows manifestation of this problem was fixed in 6.1.3 by checking to see the name of the program (if the program is not 'splunk.exe' or 'splunkd.exe', we now consider the pid-lock to be invalid and delete it). The UNIX manifestation of this problem was not fixed until 6.1.4/not-yet-released 6.2
A problem can also occur with splunkd.pid
which serves a few purposes tracking splunkd and all its subprocesses. One of its purposes is the avoidance of running multiple copies of splunkd on the same SPLUNK_HOME which could be disasterous. If this is left stale and pointing at a currently running PID, it will also prevent startup, although the test is done in the launcher, splunk.exe, before splunkd is ever started. This problem can only occur on UNIX. (On Windows, this is equivalently handled by the service manager and there is no splunkd.pid.) In 6.1.4 and the not-yet-released 6.2 we again perform some checks to see if it is really a valid splunkd. (It took a lot more effort than you might expect.)
What's the possible cause why this can happen on an indexer which is clustered on a linux deployment?
Outstanding! -Thank you for the thorough response and for the good news jrodman.
As I mentioned in an earlier comment this behaviors is most problematic for us among windows hosts having Universal Forwarder v6.1.2 and below. I look forward to increased resilience in 6.1.4. and beyond.
... and there it is in the 6.1.4 release notes:
Startup script should handle stale PID
files gracefully after server crashes.
(SPL-36597)
I have just upgraded to 6.1.4 where i use splunk stop before doing a restart or for any kind of configuration change. I get a message in v6 onwards that service didn't respond in a timely manner hence will be stopping forcibly. Is this a known issue or going to be addressed? Because once a user doesn't have access to the servers and the splunk restart doesn't occur gracefully the services are stopped forever without a manual intervention. Any comment.
This problem sounds unrelated.
I imagine the root cause is different but the outcome is similar in that your more frequent stops of the service and resultant forcible stops by splunk itself amplify the PID problem. Have you experienced the PID problem since your recent upgrade to 6.1.4?
I am glad splunk forcibly stops itself as a workaround because prior to splunk managing this situation we had to customize our software installation packages for splunk upgrades to include stop, watch and kill after specified time logic so that the software installation upgrade package could complete an upgrade from old splunk to new splunk.
That said, please do create a new thread so that I can close this one and track progress on your question separately!
do i start a new thread?
I ran a quick test in my environment. 6.1.0. I did the following:
pkill splunk
service splunk start
Here is some of the output:
Starting Splunk...
splunkd 19554 was not running.
Stopping splunk helpers...
Done.
Stopped helpers.
Removing stale pid file... done.
<continues starting>
So it will remove the stale pid file........ What is your version and OS?
Hi people.
I'm using the Splunk version 6.2.1 in a Linux virtual machine (Elementary OS version 0.2.1 64bits).
So, I solved this problem just restarting this virtual machine.
Below, I show you my log after restart the Splunk service:
root@vm-cpm-splunk-elementary:/opt/splunk/bin# ./splunk restart
splunkd 27299 was not running.
Stopping splunk helpers...
Done.
Stopped helpers.
Removing stale pid file... done.
splunkd is not running.
Splunk> Take the sh out of IT.
Checking prerequisites...
Checking http port [8000]: open
Checking mgmt port [8089]: open
Checking appserver port [127.0.0.1:8065]: open
Checking kvstore port [8191]: open
Checking configuration... Done.
Checking critical directories... Done
Checking indexes...
Validated: _audit _blocksignature _internal _introspection _thefishbucket history main summary
Done
Checking filesystem compatibility... Done
Checking conf files for problems...
Done
All preliminary checks passed.
Starting splunk server daemon (splunkd)...
Done
Waiting for web server at http://127.0.0.1:8000 to be available.............. Done
If you get stuck, we're here to help.
Look for answers here: http://docs.splunk.com
The Splunk web interface is at http://vm-cpm-splunk-elementary:8000
The misbheavior of 6.1.x is not believed to exist in 6.2.1. In 6.2.1 on Linux, splunk should only refuse to startup due to a pid file if the pid file actually does point to a real splunk process. This would mean that starting splunk up is not needed, because it is already running, or alternatively it would mean that a splunk shutdown never completed somehow (in this case, kill might be appropriate).
It seems to be back in Splunk 6.5.x if, one is running splunkd as a limited user. I just had this happen on an Indexer in Prod that was restarted via vmware.
I was able to recreate it in my QA environment. All one needs to do is mock up a pid file with ids from other processes that belong to another user like root.
Thanks alacercogitatus!
For me, this was last observed on Splunk Universal Forwarder (UF) version 6.1.2 on Windows 7 desktop.
I ran a remote health check across our UF install base and found that about 10% of agents were dormant due to presence of PID files.