Solved: Looking for a workaround for Windows UFs not start...

martin_mueller · ‎09-17-2013

Occasionally, our Windows terminal servers kill the UF service during shutdown, leaving in a stale .pid file behind. This results in Splunk not starting up, requiring manual interaction. With a large number of Windows machines that's not an option, I'm looking for a workaround - Splunk Support currently doesn't have a bug-fix schedule for me.

I see two ways: Either clean up the .pid file through a custom script/whatever, or make the UF shut down more quickly to skirt around the killing of the service during Windows shutdown.
How do you handle this issue?

dshakespeare_sp · ‎10-01-2014

http://docs.splunk.com/Documentation/Splunk/latest/ReleaseNotes/6.1.4

Resolved : Startup script should handle stale PID files gracefully after server crashes. (SPL-36597)

,

View solution in original post

dshakespeare_sp · ‎10-01-2014

http://docs.splunk.com/Documentation/Splunk/latest/ReleaseNotes/6.1.4

Resolved : Startup script should handle stale PID files gracefully after server crashes. (SPL-36597)

,

dshakespeare_sp · ‎10-02-2014

This should work for both *ix and Windows I believe

martin_mueller · ‎10-01-2014

Awesome, thanks David!

I don't have the means to test right now, does that fix apply to both full and UF installs on both Windows and Unix platforms?

letienne · ‎10-31-2013

I have the same issue.

I just opened a support case to see if they have any plan to find a viable solution ([141636])

Kind regards,

martin_mueller · ‎10-01-2013

I have now tested this with 6.0, and the issue exists there as well.

martin_mueller · ‎09-30-2013

The key problem here is that during startup the forwarder (both Windows and Linux) do only check whether their old PID exists as a process, but do not check whether that process actually is a Splunk process. As a result, the forwarder believes it already is running if a different process happens to have the old Splunk PID.

martin_mueller · ‎09-30-2013

I've done some more testing, and 5.0.5 does not fix this issue. Additionally, I've now found precise steps to reproduce, tested under 5.0.1 and 5.0.5:

Start a UF as a Windows Service
Kill the process
Edit the conf-mutator.pid file in /var/run/splunk to change the PID to the PID of an existing process. This simulates that during the UF being down a different process has been assigned its old PID. During system startup the chances for this are considerable.
Attempt to start the UF service. This will fail with logged events like this: FATAL loader - Timed out waiting for config lock

martin_mueller · ‎09-26-2013

In the release notes for 5.0.5 I see this entry:

• „Splunk on Windows does not start/restart properly with deployment server, fails with FATAL loader - Timed out waiting for config lock; see splunkd_stderr.log for details. Exiting. (SPL-70075)

This feels similar to my issue, the logged events in case of a start failure are the same. Can anyone confirm this feeling?

MuS · ‎09-17-2013

Hi martin_mueller

you forgot the third option, increasing the waittokill registry entry in Windows. I did not test it, but maybe this would be a way to go for you.

In HEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control you will find the string WaitToKillServiceTimeout.

If you double click it and then in the > Edit String window, change the > Value data from the default of 12000 (12 seconds) to whatever. (Click OK to save the change).

hope this helps...

cheers, MuS

martin_mueller · ‎09-17-2013

Thanks for your input. That should work on its own, I'll have to see how feasible it is around here to change registry settings on thousands of machines.

Looking for a workaround for Windows UFs not starting up after an improper shutdown (SPL-36597)

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?