Can Splunk help me manage large quantity of Dispat...

damonmanni · ‎03-29-2018

Goal

Is there a better/cleaner/best practice way to implement my current approach (which is using a homegrown script) to manage dispatch cleanup? My script runs on each Search head member (3 of them).

Bonus Goal
Currently I email a simple text report. Instead, I would like to create a Dashboard graphing the trend of script results.

If I stay with my current approach, then how can I extract the following data point variables out of my report, read them into splunk, then graph on dashboard?
The timestamp (TS)
The dispatched job file quantity (CURR_COUNT)
Please see sample Cleanup & Nothing to cleanup Reports below

Alternatively, If there is a better approach to via queries, reports, graphs, etc then no issue

Script Effectiveness to date
Unless I am nuking files that I should not or any other logic holes, this script has been effective in keeping customer savedsearches running consistently vs. b/f when lots of complaints.
I have also increased resource values for both splunk and RHEL OS to help thruput

Cron entry

\*/60 \* \* \* \* su - splunk -c /opt/splunk/scripts/cleanup_dispatched_jobs.sh &> /tmp/dispatch.log

Code

#!/bin/bash
##############################################################################
# Damon Manni
# Runs in cronjob every hour every day - until better fix is in place
# must run as splunk user, not root
    ###############################################################################
# VARs
GO_BACK="-30m"
CEILING="1000"                                                                                             
  # My arbitrary threshold to trigger on
    SPLUNK_HOME="/opt/splunk"
    SCRIPT_ROOT="${SPLUNK_HOME}/scripts"
    DISPATCH_DIR="${SPLUNK_HOME}/var/run/splunk/dispatch"
    HOLDING_DIR="${SPLUNK_HOME}/old-dispatch-jobs"
    CLEANUP_CMD="${SPLUNK_HOME}/bin/splunk cmd splunkd clean-dispatch"
    OUTPUT="${SCRIPT_ROOT}/dispatch.out"
    # Functions
    reset() {
    # setup for a clean run. temp/log/report files
      rm -rf ${SCRIPT_ROOT}/*.out
    }
    check_quota() {
    # simple check to see if the dispatched job file volume surpasses my arbitrary ceiling to cleanup or wait until next script run
      CURR_COUNT="`ls -1 ${DISPATCH_DIR} | wc -l`"                                      # I want to graph CURR_COUNT in a Dashboard. How?
      [ ${CURR_COUNT} -gt ${CEILING} ] && cleanup  || bow_out
    }
    bow_out() {
    #  All good
      echo "${TS}"
      echo "Current count = ${CURR_COUNT} - no need to cleanup.  Waiting until next job run"
      cat /tmp/dispatch.log | mail -s "Cleanup Dispatch-${HOSTNAME}: ${CURR_COUNT}" jojo@thedolphin.com
      exit 0
    }
    cleanup() {
    # Triggered High volume that can impact job/parsing queue's, etc.   
     temp_dir
      ${CLEANUP_CMD} ${HOLDING_DIR}/${TS} ${GO_BACK} 2>&1 > ${OUTPUT}
    }
    gen_ts() {
      TS="`date +"%m-%d-%Y_%H-%M-%S"`"
    }
    temp_dir() {
    # Create unique dir to recieve snaphot b/f cleanup/delete
      [ ! -d ${HOLDING_DIR}/${TS} ]  && { echo "Creating holding dir..."; mkdir -p ${HOLDING_DIR}/${TS}; }
    }
    # Data points to help debugging/status
    report() {
      echo "${TS}"                                                                                                          # I want to graph TS in a Dashboard. How?
      echo "Ceiling = ${CEILING}"
      echo "Current count = ${CURR_COUNT}"                                                         # I want to graph CURR_COUNT in a Dashboard. How?
      echo "Holding dir =  ${HOLDING_DIR}/${TS}"
      echo
      cat "${OUTPUT}"
      echo "Tarball = ${HOLDING_DIR}/${TS}.tar.z"
      cat "${OUTPUT}" /tmp/dispatch.log | mail -s "Cleanup Dispatch-${HOSTNAME}: ${CURR_COUNT}" jojo@thedolphin.com
    }
    squeeze() {
    # compress backup to run lean
    echo "Making tarball for backup..."
    tar zcvf ${HOLDING_DIR}/${TS}.tar.z  ${HOLDING_DIR}/${TS}
    [ $? -eq 0 ] && { echo "Done."; rm -rf ${HOLDING_DIR}/${TS}; } || { echo "Failed."; exit 1; }
    }
# Main
echo "HOSTNAME"
reset
gen_ts
check_quota
report
squeeze
exit 0

Cleanup Report

Search-Head-member-hostname
Creating holding dir
Using logging configuration at /opt/splunk/etc/log-cmdline.cfg.03-29-2018_12-00-05
Ceiling = 1000
Current count = 1153
Holding dir \=  /opt/splunk/old-dispatch-jobs/03-29-2018_12-00-05
dispatch dir:      /opt/splunk/var/run/splunk/dispatch
destination dir:   /opt/splunk/old-dispatch-jobs/03-29-2018_12-00-05
earliest mod time: 2018-03-29T11:30:05.000-04:00
total: 1153, moved: 823, failed: 0, remaining: 330 job directories from /opt/splunk/var/run/splunk/dispatch to /opt/splunk/old-dispatch-jobs/03-29-2018_12-00-05
Tarball = /opt/splunk/old-dispatch-jobs/03-29-2018_12-00-05.tar.z

Nothing to cleanup report

Search-Head-member-hostname
03-29-2018_15-00-02
Current count = 927 - no need to cleanup.  Waiting until next job run

All help much appreciated as always.
cheers,
Damon

Can Splunk help me manage large quantity of Dispatched jobs and queue delays?

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor