Getting Data In

Splunk archive app: need advice on script to cleanup Hadoop data

tsunamii
Path Finder

We are now using Splunk archiving. I understand that there is no mechanism to delete the Hadoop Splunk data that has been archived. I would like to write a general script for deletion based on date (e.g. might want to delete data more than 60 days.)

Here is a sample archived directory with the timestamps and identify directories to be deleted that are older than n days. There are timestamps on the directory names. Would I recurse down to the directory that has the journal.gz e.g. 1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz and then check the timestamps of 1440973083 and 1439867820 and if these were OLDER than n days ago delete the directory and files: db_1440973083_1439867820_1/journal.gz or what? Please advise.

    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1 
    -rw------- 3 splunk splunk 117 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.content-md5_9ff9fb525c137adf5aac9184b62a22f2.receipt 
    -rw------- 3 splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.valid 
    -rw------- 3 splunk splunk 14002 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/bucket-metadata.seq 
    -rw------- 3 splunk splunk 2507794 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz

kschon_splunk
Splunk Employee
Splunk Employee

Yes, that sounds basically correct. The timestamps in the directory name are [latest time]_[earliest time], so you only need to check the first one. Note that these times refer to the events in that bucket. The date the bucket was archived might have been significantly later. Also, you may want to delete any higher-level directories that are empty after you delete the buckets, both to conserve HDFS inodes, and to make Hunk split-generation marginally faster.

Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Event Series: Splunk Observability Metrics Cost Optimization

Balancing Scale and Spend: Gaining Control Over High-Volume Metrics in Splunk Observability Cloud As ...

Kick the Tires Before You Commit: A Hands-On Tour of the Splunk Observability Cloud ...

Evaluating an enterprise observability platform usually goes like this: fill out a form, get a free trial with ...

Deep insights, no barriers: Splunk Observability Cloud Free Edition

As software delivery cycles continue to accelerate, observability shouldn’t be a luxury — it should be a ...