Splunk Search

Splunk 6.4: Hunk + Hive: Inconsistent removal of files in dispatch

burwell
SplunkTrust
SplunkTrust

We have Splunk 6.4 and are using Hunk + Hive. Our jobs produce 100,000+ files in dispatch.

What is the expected behavior of removal of files in dispatch?

I have seen older files in dispatch get removed when I run a new job (yeah!) but not always. More often than not the files stay around. I have to schedule a script to remove the files and sometimes I cannot even keep one days worth of files.

Thanks.

Tags (3)
0 Karma
1 Solution

kschon_splunk
Splunk Employee
Splunk Employee

I assume you are referring to dispatch dirs on HDFS? If so, some of the files in the dispatch dir are deleted when the search completes, but some stay in place, so that the search head can re-read them if necessary.

Once the corresponding dispatch dir on the search head is no longer present, the dispatch dir on HDFS is eligible to be deleted. As you noted, this happens when a new search is run. A "reaper" daemon thread will be launched, which crawls the HDFS dispatch area, looking for dirs that no longer correspond to searches the SH is managing, and deleting them.

Your dispatch directories could be persisting for a couple of reasons:
1) The dispatch dir on the SH still exists. The TTL for a search varies depending on different properties of the search. This blog post has some more info: http://blogs.splunk.com/2012/09/12/how-long-does-my-search-live-default-search-ttl/

2) The reaper thread is a daemon, so it will not outlive the search it is associated with. A short search may not give the reaper enough time to completely delete all expired dispatch dirs.

View solution in original post

ddrillic
Ultra Champion

We saw this issue consistently with "older" versions of Hunk and we ended up setting the dedicated MapR volume to 3/4 of a terabyte. With 6.3.3 the dispatch directory on the HDFS is being kept tiny regardless of the query volume. Maybe something is off with 6.4...

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

I assume you are referring to dispatch dirs on HDFS? If so, some of the files in the dispatch dir are deleted when the search completes, but some stay in place, so that the search head can re-read them if necessary.

Once the corresponding dispatch dir on the search head is no longer present, the dispatch dir on HDFS is eligible to be deleted. As you noted, this happens when a new search is run. A "reaper" daemon thread will be launched, which crawls the HDFS dispatch area, looking for dirs that no longer correspond to searches the SH is managing, and deleting them.

Your dispatch directories could be persisting for a couple of reasons:
1) The dispatch dir on the SH still exists. The TTL for a search varies depending on different properties of the search. This blog post has some more info: http://blogs.splunk.com/2012/09/12/how-long-does-my-search-live-default-search-ttl/

2) The reaper thread is a daemon, so it will not outlive the search it is associated with. A short search may not give the reaper enough time to completely delete all expired dispatch dirs.

burwell
SplunkTrust
SplunkTrust

Thanks Keith. I was referring to files in HDFS.

Since there are so many files, using up many inodes, I was more aware of the files staying around.

Thanks for this information.

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

This challenge was first posted on Slack #puzzles channelFor BORE at .conf23, we had a puzzle question which ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...