Goal
Is there a better/cleaner/best practice way to implement my current approach (which is using a homegrown script) to manage dispatch cleanup? My script runs on each Search head member (3 of them).
Bonus Goal
Currently I email a simple text report. Instead, I would like to create a Dashboard graphing the trend of script results.
If I stay with my current approach, then how can I extract the following data point variables out of my report, read them into splunk, then graph on dashboard?
The timestamp (TS)
The dispatched job file quantity (CURR_COUNT)
Please see sample Cleanup & Nothing to cleanup Reports below
Alternatively, If there is a better approach to via queries, reports, graphs, etc then no issue
Script Effectiveness to date
Unless I am nuking files that I should not or any other logic holes, this script has been effective in keeping customer savedsearches running consistently vs. b/f when lots of complaints.
I have also increased resource values for both splunk and RHEL OS to help thruput
Cron entry
\*/60 \* \* \* \* su - splunk -c /opt/splunk/scripts/cleanup_dispatched_jobs.sh &> /tmp/dispatch.log
Code
#!/bin/bash
##############################################################################
# Damon Manni
# Runs in cronjob every hour every day - until better fix is in place
# must run as splunk user, not root
###############################################################################
# VARs
GO_BACK="-30m"
CEILING="1000"
# My arbitrary threshold to trigger on
SPLUNK_HOME="/opt/splunk"
SCRIPT_ROOT="${SPLUNK_HOME}/scripts"
DISPATCH_DIR="${SPLUNK_HOME}/var/run/splunk/dispatch"
HOLDING_DIR="${SPLUNK_HOME}/old-dispatch-jobs"
CLEANUP_CMD="${SPLUNK_HOME}/bin/splunk cmd splunkd clean-dispatch"
OUTPUT="${SCRIPT_ROOT}/dispatch.out"
# Functions
reset() {
# setup for a clean run. temp/log/report files
rm -rf ${SCRIPT_ROOT}/*.out
}
check_quota() {
# simple check to see if the dispatched job file volume surpasses my arbitrary ceiling to cleanup or wait until next script run
CURR_COUNT="`ls -1 ${DISPATCH_DIR} | wc -l`" # I want to graph CURR_COUNT in a Dashboard. How?
[ ${CURR_COUNT} -gt ${CEILING} ] && cleanup || bow_out
}
bow_out() {
# All good
echo "${TS}"
echo "Current count = ${CURR_COUNT} - no need to cleanup. Waiting until next job run"
cat /tmp/dispatch.log | mail -s "Cleanup Dispatch-${HOSTNAME}: ${CURR_COUNT}" jojo@thedolphin.com
exit 0
}
cleanup() {
# Triggered High volume that can impact job/parsing queue's, etc.
temp_dir
${CLEANUP_CMD} ${HOLDING_DIR}/${TS} ${GO_BACK} 2>&1 > ${OUTPUT}
}
gen_ts() {
TS="`date +"%m-%d-%Y_%H-%M-%S"`"
}
temp_dir() {
# Create unique dir to recieve snaphot b/f cleanup/delete
[ ! -d ${HOLDING_DIR}/${TS} ] && { echo "Creating holding dir..."; mkdir -p ${HOLDING_DIR}/${TS}; }
}
# Data points to help debugging/status
report() {
echo "${TS}" # I want to graph TS in a Dashboard. How?
echo "Ceiling = ${CEILING}"
echo "Current count = ${CURR_COUNT}" # I want to graph CURR_COUNT in a Dashboard. How?
echo "Holding dir = ${HOLDING_DIR}/${TS}"
echo
cat "${OUTPUT}"
echo "Tarball = ${HOLDING_DIR}/${TS}.tar.z"
cat "${OUTPUT}" /tmp/dispatch.log | mail -s "Cleanup Dispatch-${HOSTNAME}: ${CURR_COUNT}" jojo@thedolphin.com
}
squeeze() {
# compress backup to run lean
echo "Making tarball for backup..."
tar zcvf ${HOLDING_DIR}/${TS}.tar.z ${HOLDING_DIR}/${TS}
[ $? -eq 0 ] && { echo "Done."; rm -rf ${HOLDING_DIR}/${TS}; } || { echo "Failed."; exit 1; }
}
# Main
echo "HOSTNAME"
reset
gen_ts
check_quota
report
squeeze
exit 0
Cleanup Report
Search-Head-member-hostname
Creating holding dir
Using logging configuration at /opt/splunk/etc/log-cmdline.cfg.03-29-2018_12-00-05
Ceiling = 1000
Current count = 1153
Holding dir \= /opt/splunk/old-dispatch-jobs/03-29-2018_12-00-05
dispatch dir: /opt/splunk/var/run/splunk/dispatch
destination dir: /opt/splunk/old-dispatch-jobs/03-29-2018_12-00-05
earliest mod time: 2018-03-29T11:30:05.000-04:00
total: 1153, moved: 823, failed: 0, remaining: 330 job directories from /opt/splunk/var/run/splunk/dispatch to /opt/splunk/old-dispatch-jobs/03-29-2018_12-00-05
Tarball = /opt/splunk/old-dispatch-jobs/03-29-2018_12-00-05.tar.z
Nothing to cleanup report
Search-Head-member-hostname
03-29-2018_15-00-02
Current count = 927 - no need to cleanup. Waiting until next job run
All help much appreciated as always.
cheers,
Damon