Occasionally, our Windows terminal servers kill the UF service during shutdown, leaving in a stale .pid file behind. This results in Splunk not starting up, requiring manual interaction. With a large number of Windows machines that's not an option, I'm looking for a workaround - Splunk Support currently doesn't have a bug-fix schedule for me.
I see two ways: Either clean up the .pid file through a custom script/whatever, or make the UF shut down more quickly to skirt around the killing of the service during Windows shutdown.
How do you handle this issue?
http://docs.splunk.com/Documentation/Splunk/latest/ReleaseNotes/6.1.4
Resolved : Startup script should handle stale PID files gracefully after server crashes. (SPL-36597)
,
http://docs.splunk.com/Documentation/Splunk/latest/ReleaseNotes/6.1.4
Resolved : Startup script should handle stale PID files gracefully after server crashes. (SPL-36597)
,
This should work for both *ix and Windows I believe
Awesome, thanks David!
I don't have the means to test right now, does that fix apply to both full and UF installs on both Windows and Unix platforms?
I have the same issue.
I just opened a support case to see if they have any plan to find a viable solution ([141636])
Kind regards,
I have now tested this with 6.0, and the issue exists there as well.
The key problem here is that during startup the forwarder (both Windows and Linux) do only check whether their old PID exists as a process, but do not check whether that process actually is a Splunk process. As a result, the forwarder believes it already is running if a different process happens to have the old Splunk PID.
I've done some more testing, and 5.0.5 does not fix this issue. Additionally, I've now found precise steps to reproduce, tested under 5.0.1 and 5.0.5:
In the release notes for 5.0.5 I see this entry:
• „Splunk on Windows does not start/restart properly with deployment server, fails with FATAL loader - Timed out waiting for config lock; see splunkd_stderr.log for details. Exiting. (SPL-70075)
This feels similar to my issue, the logged events in case of a start failure are the same. Can anyone confirm this feeling?
Hi martin_mueller
you forgot the third option, increasing the waittokill
registry entry in Windows. I did not test it, but maybe this would be a way to go for you.
In HEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control
you will find the string WaitToKillServiceTimeout
.
If you double click it and then in the > Edit String window, change the > Value data from the default of 12000 (12 seconds) to whatever. (Click OK to save the change).
hope this helps...
cheers, MuS
Thanks for your input. That should work on its own, I'll have to see how feasible it is around here to change registry settings on thousands of machines.