We have Splunk 6.4 and are using Hunk + Hive. Our jobs produce 100,000+ files in dispatch.
What is the expected behavior of removal of files in dispatch?
I have seen older files in dispatch get removed when I run a new job (yeah!) but not always. More often than not the files stay around. I have to schedule a script to remove the files and sometimes I cannot even keep one days worth of files.
I assume you are referring to dispatch dirs on HDFS? If so, some of the files in the dispatch dir are deleted when the search completes, but some stay in place, so that the search head can re-read them if necessary.
Once the corresponding dispatch dir on the search head is no longer present, the dispatch dir on HDFS is eligible to be deleted. As you noted, this happens when a new search is run. A "reaper" daemon thread will be launched, which crawls the HDFS dispatch area, looking for dirs that no longer correspond to searches the SH is managing, and deleting them.
Your dispatch directories could be persisting for a couple of reasons:
1) The dispatch dir on the SH still exists. The TTL for a search varies depending on different properties of the search. This blog post has some more info: http://blogs.splunk.com/2012/09/12/how-long-does-my-search-live-default-search-ttl/
2) The reaper thread is a daemon, so it will not outlive the search it is associated with. A short search may not give the reaper enough time to completely delete all expired dispatch dirs.
Thanks Keith. I was referring to files in HDFS.
Since there are so many files, using up many inodes, I was more aware of the files staying around.
Thanks for this information.
We saw this issue consistently with "older" versions of Hunk and we ended up setting the dedicated MapR volume to 3/4 of a terabyte. With 6.3.3 the dispatch directory on the HDFS is being kept tiny regardless of the query volume. Maybe something is off with 6.4...