Splunk Search

Splunk 6.4: Hunk + Hive: Inconsistent removal of files in dispatch

burwell
SplunkTrust
SplunkTrust

We have Splunk 6.4 and are using Hunk + Hive. Our jobs produce 100,000+ files in dispatch.

What is the expected behavior of removal of files in dispatch?

I have seen older files in dispatch get removed when I run a new job (yeah!) but not always. More often than not the files stay around. I have to schedule a script to remove the files and sometimes I cannot even keep one days worth of files.

Thanks.

Tags (3)
0 Karma
1 Solution

kschon_splunk
Splunk Employee
Splunk Employee

I assume you are referring to dispatch dirs on HDFS? If so, some of the files in the dispatch dir are deleted when the search completes, but some stay in place, so that the search head can re-read them if necessary.

Once the corresponding dispatch dir on the search head is no longer present, the dispatch dir on HDFS is eligible to be deleted. As you noted, this happens when a new search is run. A "reaper" daemon thread will be launched, which crawls the HDFS dispatch area, looking for dirs that no longer correspond to searches the SH is managing, and deleting them.

Your dispatch directories could be persisting for a couple of reasons:
1) The dispatch dir on the SH still exists. The TTL for a search varies depending on different properties of the search. This blog post has some more info: http://blogs.splunk.com/2012/09/12/how-long-does-my-search-live-default-search-ttl/

2) The reaper thread is a daemon, so it will not outlive the search it is associated with. A short search may not give the reaper enough time to completely delete all expired dispatch dirs.

View solution in original post

ddrillic
Ultra Champion

We saw this issue consistently with "older" versions of Hunk and we ended up setting the dedicated MapR volume to 3/4 of a terabyte. With 6.3.3 the dispatch directory on the HDFS is being kept tiny regardless of the query volume. Maybe something is off with 6.4...

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

I assume you are referring to dispatch dirs on HDFS? If so, some of the files in the dispatch dir are deleted when the search completes, but some stay in place, so that the search head can re-read them if necessary.

Once the corresponding dispatch dir on the search head is no longer present, the dispatch dir on HDFS is eligible to be deleted. As you noted, this happens when a new search is run. A "reaper" daemon thread will be launched, which crawls the HDFS dispatch area, looking for dirs that no longer correspond to searches the SH is managing, and deleting them.

Your dispatch directories could be persisting for a couple of reasons:
1) The dispatch dir on the SH still exists. The TTL for a search varies depending on different properties of the search. This blog post has some more info: http://blogs.splunk.com/2012/09/12/how-long-does-my-search-live-default-search-ttl/

2) The reaper thread is a daemon, so it will not outlive the search it is associated with. A short search may not give the reaper enough time to completely delete all expired dispatch dirs.

burwell
SplunkTrust
SplunkTrust

Thanks Keith. I was referring to files in HDFS.

Since there are so many files, using up many inodes, I was more aware of the files staying around.

Thanks for this information.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...