Getting Data In

Splunk archive app: need advice on script to cleanup Hadoop data

Path Finder

We are now using Splunk archiving. I understand that there is no mechanism to delete the Hadoop Splunk data that has been archived. I would like to write a general script for deletion based on date (e.g. might want to delete data more than 60 days.)

Here is a sample archived directory with the timestamps and identify directories to be deleted that are older than n days. There are timestamps on the directory names. Would I recurse down to the directory that has the journal.gz e.g. 1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz and then check the timestamps of 1440973083 and 1439867820 and if these were OLDER than n days ago delete the directory and files: db_1440973083_1439867820_1/journal.gz or what? Please advise.

    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1 
    -rw------- 3 splunk splunk 117 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.content-md5_9ff9fb525c137adf5aac9184b62a22f2.receipt 
    -rw------- 3 splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.valid 
    -rw------- 3 splunk splunk 14002 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/bucket-metadata.seq 
    -rw------- 3 splunk splunk 2507794 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz

Splunk Employee
Splunk Employee

Yes, that sounds basically correct. The timestamps in the directory name are [latest time]_[earliest time], so you only need to check the first one. Note that these times refer to the events in that bucket. The date the bucket was archived might have been significantly later. Also, you may want to delete any higher-level directories that are empty after you delete the buckets, both to conserve HDFS inodes, and to make Hunk split-generation marginally faster.

Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!