Solved: Why is the monit process sometimes restarting

mataharry · ‎02-24-2016

I have Linux servers with Splunk, and the process monit to check my processed.

But sometimes I see an issue where monit restarts Splunk unexpectedly.

yannK · ‎02-24-2016

Sometimes Monit may failing to read the pid file of splunk, and decide too quickly that splunk is down

There are several common scenarii :
- splunk restarted for no reason (when the pid file was updated by a search process or child processes, monit gave up too quickly)
- splunk started twice during a restart. (splunk deletes the pod file when shutting down, and monit can restart it too quickly, ending up with 2 splunk process and a port conflict)

Please tune your monit logic to retry / wait more cycles before jumping the gun.

View solution in original post

awyszkowski · ‎02-24-2016

Here's some pointers for a "real world" Splunk process monitor in Monit, that will restart splunk when it is detected down by 'splunk status'.

First off, we want to get better downtime detection. We want to do away with pid checking and port checking, this often leads to confusion as pids can be somewhat fluid with Splunk. We also know that part of "normal" operation of Splunk can involve a restart (be it a rolling restart, a GUI invoked administrative restart after installing an app, etc). Best off to use Splunk's own "splunk status" command, and exploit the fact that exit status carries some value (0 means it's running, other status mean it is not or there was an issue determining state).

Secondly, monit tends to want to shut off the service prior to restarting it. This can lead to ugliness if splunk was actually running. So rather than using restart logic, just use a 'splunk start' to get it going again ('splunk start' is effectively a non-op if splunk is already running, as opposed to a stop-start).

Note - this does no alerting, and merely starts Splunk when it is detected down for two consecutive windows 5 minutes apart (you might have to tweak your settings if your global monit polling frequency is different).

-

Assuming you have the following setting in /etc/monitrc

# Polling frequency
set daemon 20

In /etc/monit/splunk_health.sh (new file)

#!/bin/bash
TEXT=`/opt/splunk/bin/splunk status 2>&1`
STATUS=$?
>&2 echo $TEXT
exit $STATUS

In /etc/monit/conf.d/splunk.monitrc (probably a new file)

check program splunkd with path "/etc/monit/splunk_health.sh" every 15 cycles
    start program = "/usr/sbin/service splunk start"
    stop program = "/usr/sbin/service splunk stop"
    if status !=0 for 2 cycles then start

yannK · ‎02-24-2016

Sometimes Monit may failing to read the pid file of splunk, and decide too quickly that splunk is down

There are several common scenarii :
- splunk restarted for no reason (when the pid file was updated by a search process or child processes, monit gave up too quickly)
- splunk started twice during a restart. (splunk deletes the pod file when shutting down, and monit can restart it too quickly, ending up with 2 splunk process and a port conflict)

Please tune your monit logic to retry / wait more cycles before jumping the gun.

Why is the monit process sometimes restarting

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!