Getting Data In

Why is the monit process sometimes restarting

mataharry
Communicator

I have Linux servers with Splunk, and the process monit to check my processed.

But sometimes I see an issue where monit restarts Splunk unexpectedly.

Tags (3)
1 Solution

yannK
Splunk Employee
Splunk Employee

Sometimes Monit may failing to read the pid file of splunk, and decide too quickly that splunk is down

There are several common scenarii :
- splunk restarted for no reason (when the pid file was updated by a search process or child processes, monit gave up too quickly)
- splunk started twice during a restart. (splunk deletes the pod file when shutting down, and monit can restart it too quickly, ending up with 2 splunk process and a port conflict)

Please tune your monit logic to retry / wait more cycles before jumping the gun.

View solution in original post

awyszkowski
Splunk Employee
Splunk Employee

Here's some pointers for a "real world" Splunk process monitor in Monit, that will restart splunk when it is detected down by 'splunk status'.

First off, we want to get better downtime detection. We want to do away with pid checking and port checking, this often leads to confusion as pids can be somewhat fluid with Splunk. We also know that part of "normal" operation of Splunk can involve a restart (be it a rolling restart, a GUI invoked administrative restart after installing an app, etc). Best off to use Splunk's own "splunk status" command, and exploit the fact that exit status carries some value (0 means it's running, other status mean it is not or there was an issue determining state).

Secondly, monit tends to want to shut off the service prior to restarting it. This can lead to ugliness if splunk was actually running. So rather than using restart logic, just use a 'splunk start' to get it going again ('splunk start' is effectively a non-op if splunk is already running, as opposed to a stop-start).

Note - this does no alerting, and merely starts Splunk when it is detected down for two consecutive windows 5 minutes apart (you might have to tweak your settings if your global monit polling frequency is different).

-

Assuming you have the following setting in /etc/monitrc

# Polling frequency
set daemon 20

In /etc/monit/splunk_health.sh (new file)

#!/bin/bash
TEXT=`/opt/splunk/bin/splunk status 2>&1`
STATUS=$?
>&2 echo $TEXT
exit $STATUS

In /etc/monit/conf.d/splunk.monitrc (probably a new file)

check program splunkd with path "/etc/monit/splunk_health.sh" every 15 cycles
    start program = "/usr/sbin/service splunk start"
    stop program = "/usr/sbin/service splunk stop"
    if status !=0 for 2 cycles then start

yannK
Splunk Employee
Splunk Employee

Sometimes Monit may failing to read the pid file of splunk, and decide too quickly that splunk is down

There are several common scenarii :
- splunk restarted for no reason (when the pid file was updated by a search process or child processes, monit gave up too quickly)
- splunk started twice during a restart. (splunk deletes the pod file when shutting down, and monit can restart it too quickly, ending up with 2 splunk process and a port conflict)

Please tune your monit logic to retry / wait more cycles before jumping the gun.

Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...