We have a continual issue in our environment with the $SPLUNK_HOME/var/run/dispatch directory growing out of control – constantly above 2000 directories and decreasing system performance.
There are 2 usecases that seem to cause the biggest issue:
1. Realtime searches that alert frequently. In this case I see that a new result(and directory) is created every 1 -2 minutes. This has the ability to create up hundreds of directories within a few hours. Most of these realtime alerts are already restricted to a 24 hour retention, however this doesn’t help if alerts are triggered all night, then there are easily 500+ directories by the morning for just one search...
Between these two usecases we often have Splunk exceeding 3000+ directories quite freqently.
I’m curious how other people are managing this?
In some circumstances it makes sense to retain results for 30 days; in the case of a daily search.
It also makes sense for critical monitoring to have frequent alerts. However, a combination of both creates too many directories in dispatch for Splunk to operate efficiently.
Is there a mechanism to enforce job retention to a particular user role? ie 24hours only
Is there any mechanism to alter how the dispatch directory operates? Even sub folders per app or per user would really help in this case…
You should simply change the retention periods of your saved searches. They are controlled by the
timeout parameter, though depending on how the search is scheduling, there are many places the value may be set or overridden. See the savedsearches.conf and alert_actions.conf files.
As for users, you can use roles to limit the amount of space a user uses, which indirectly should limit the number of jobs they keep around.
Go in to the app which is having maximum searches or least useful. In its local directory, make a limits.conf and update the ttl value.
* The time to live (ttl), in seconds, of the cache for the results of a given
* Do not set this below 120 seconds.
* See the definition in the [search] stanza under the “TTL” section for more
details on how the ttl is computed.
* Default: 300 (5 minutes)