Sometimes Monit may failing to read the pid file of splunk, and decide too quickly that splunk is down
There are several common scenarii :
- splunk restarted for no reason (when the pid file was updated by a search process or child processes, monit gave up too quickly)
- splunk started twice during a restart. (splunk deletes the pod file when shutting down, and monit can restart it too quickly, ending up with 2 splunk process and a port conflict)
Please tune your monit logic to retry / wait more cycles before jumping the gun.
Here's some pointers for a "real world" Splunk process monitor in Monit, that will restart splunk when it is detected down by 'splunk status'.
First off, we want to get better downtime detection. We want to do away with pid checking and port checking, this often leads to confusion as pids can be somewhat fluid with Splunk. We also know that part of "normal" operation of Splunk can involve a restart (be it a rolling restart, a GUI invoked administrative restart after installing an app, etc). Best off to use Splunk's own "splunk status" command, and exploit the fact that exit status carries some value (0 means it's running, other status mean it is not or there was an issue determining state).
Secondly, monit tends to want to shut off the service prior to restarting it. This can lead to ugliness if splunk was actually running. So rather than using restart logic, just use a 'splunk start' to get it going again ('splunk start' is effectively a non-op if splunk is already running, as opposed to a stop-start).
Note - this does no alerting, and merely starts Splunk when it is detected down for two consecutive windows 5 minutes apart (you might have to tweak your settings if your global monit polling frequency is different).
-
Assuming you have the following setting in /etc/monitrc
# Polling frequency
set daemon 20
In /etc/monit/splunk_health.sh (new file)
#!/bin/bash
TEXT=`/opt/splunk/bin/splunk status 2>&1`
STATUS=$?
>&2 echo $TEXT
exit $STATUS
In /etc/monit/conf.d/splunk.monitrc (probably a new file)
check program splunkd with path "/etc/monit/splunk_health.sh" every 15 cycles
start program = "/usr/sbin/service splunk start"
stop program = "/usr/sbin/service splunk stop"
if status !=0 for 2 cycles then start
Sometimes Monit may failing to read the pid file of splunk, and decide too quickly that splunk is down
There are several common scenarii :
- splunk restarted for no reason (when the pid file was updated by a search process or child processes, monit gave up too quickly)
- splunk started twice during a restart. (splunk deletes the pod file when shutting down, and monit can restart it too quickly, ending up with 2 splunk process and a port conflict)
Please tune your monit logic to retry / wait more cycles before jumping the gun.