Getting Data In

Splunk archive app: need advice on script to cleanup Hadoop data

tsunamii
Path Finder

We are now using Splunk archiving. I understand that there is no mechanism to delete the Hadoop Splunk data that has been archived. I would like to write a general script for deletion based on date (e.g. might want to delete data more than 60 days.)

Here is a sample archived directory with the timestamps and identify directories to be deleted that are older than n days. There are timestamps on the directory names. Would I recurse down to the directory that has the journal.gz e.g. 1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz and then check the timestamps of 1440973083 and 1439867820 and if these were OLDER than n days ago delete the directory and files: db_1440973083_1439867820_1/journal.gz or what? Please advise.

    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1 
    -rw------- 3 splunk splunk 117 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.content-md5_9ff9fb525c137adf5aac9184b62a22f2.receipt 
    -rw------- 3 splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.valid 
    -rw------- 3 splunk splunk 14002 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/bucket-metadata.seq 
    -rw------- 3 splunk splunk 2507794 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz

kschon_splunk
Splunk Employee
Splunk Employee

Yes, that sounds basically correct. The timestamps in the directory name are [latest time]_[earliest time], so you only need to check the first one. Note that these times refer to the events in that bucket. The date the bucket was archived might have been significantly later. Also, you may want to delete any higher-level directories that are empty after you delete the buckets, both to conserve HDFS inodes, and to make Hunk split-generation marginally faster.

Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

This challenge was first posted on Slack #puzzles channelFor BORE at .conf23, we had a puzzle question which ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...