I've begun seeing this message on a regular basis on my SH. I've seen links on how to clean it up, but no real statement on 1) what causes the problem and 2) should splunk be handling this on its own.
Can someone help me out? (And yes, we are experiencing a growth in users and searches right now).
You have answered your own query.. that thr is growth in users in searches. This is the reason you are getting this error because of limitation of Space and Performance ( running scheduled searches)
Best solution to this is
1. Go to dispatch directory (/opt/splunk/var/run/splunk/dispatch)
2. Delete old searches ( Delete from bottom)
3. Once you do this and restart your search Head error will disapear and you are good to perform you Activities on Splunk.
Please Accept and vote answer if it helps!!!!
Also Splunk should handle this but in our environment we have noticed that this is not properly handled by Splunk. We only need to take care of this activity by creating a script or deleting them manually.
So, is that a bug that Splunk doesn't handle it? What am I deleting - output from "old" searches?
If one directory would be created for each search how could we identify the respective directories of the searches which are completed their execution successfully.Could you please help me with any tutorials to write a script to automate the cleaning process.
You should use "splunkd clean-dispatch" to get rid of these.
usage: splunkd clean-dispatch '<destination directory where to move jobs>' '<latest job mod time>' splunk clean-dispatch /opt/splunk/old-dispatch-jobs/ -3d
It seems that no one has addressed the first question from the OP: "What causes the problem?"
The answer first begins with understanding what the dispatch directory is. The artifacts of a search (read: the logs, results, sometimes intermediate results, status) land in the dispatch directory. If it helps to think of it as the "working directory", go with that. In addition, the dispatch directories can be re-read to load up the results of a search that has already run for you. This is why scheduled searches (say, to drive a dashboard) can load so quickly: the results are already there in the dispatch directory.
These directories have a lifespan directly commensurate with what they're doing. Dispatch directories created for an ad-hoc search (think of a user sat at the search bar, typing in search terms) will persist for 10 minutes. Any scheduled search has a default TTL (time to live) of "2p" where p represents the "period" or time range of the search.
If you schedule a search to run over a 24h period (say, rolling up stats for "yesterday"), then that dispatch directory will persist for 48h. The idea here is that there should be overlap between the runtimes of these scheduled searches, so that if a run of the search is skipped for some reason, there's still a cached (albeit old) set of results.
If the search triggers some sort of action, such as sending you an email, or alerting in the Splunk GUI, the TTL will be adjusted to a certain minimum value. The idea here is that if the search triggered some kind of alert, we want to make sure we cache those results long enough for someone to come back and have a look.
Consider a monitoring style search running over a 5 minute span; this would typically live only 10 minutes before being reaped. However, if that search finds something worth alerting on, (say a host over 95% of its CPU) and thereby triggers an email action, it's going to live for 24h so that somebody can come back and review the results.
In environments making use of this kind of alerting mechanism, it's not uncommon to see dispatch directories pile up before their TTL is reached. Yes, it's true that you can clean the dispatch jobs, but I find that it's better to understand why they're piling up and address that fact first, else you'll just have to do the cleaning activity again and again.
Several reasons why you may have many jobs in the dispatch are :
By example :
- an alert with 24 hour retention, that runs every 5 minutes will generate one job artifact in dispatch each time -> 288 jobs in the dispatch all the time
- a realtime alert that triggers in average every 5 seconds with tracking will stay 24h in the dispatch -> 17280 jobs in permanence in one day
to avoid this :
Real-time alerts spammed our dispatch folder and ended up breaking the entire Splunk interface. Cleared /var/run/splunk/dispatch and modded the real-time alerts and boom, fixed.
If anyone doesn't know cron schedules, setting to "* * * * *" should fix this problem. It's alerting every minute instead of real-time.