Getting Data In

How do I reclaim disk space after a massive delete operation?

Super Champion

How do I reclaim my disk space after deleting a large number of events from an index?

The Remove data from Splunk pages says:

Currently, piping to delete does not reclaim disk space, but Splunk will be delivering a utility in a future release that reclaims the disk space--this will go through and permanently remove all the events marked by the delete operator.

Is there any other way of reclaiming this space in the meantime?

Tags (2)
1 Solution

Super Champion

It is possible to reclaim disk space in this type of scenario by re-indexing the effected buckets.

Note: This may also be useful if you've deleted some sensitive information, such as a password, that really needs to be completely purged. This approach would prevent that indexed term from showing up in type-a-head, for example.

There are several steps to this process.

  1. Identify all buckets for each index that were effected by your deletion. (This alone can be a complicated task, also keep in mind that the delete command forces a bucket roll for hot buckets.)
  2. For each bucket, do the following:
    1. Export the bucket data to a .csv file
    2. Import the .csv file into a new empty bucket (with a temporary name/location)
    3. Optimize the new bucket.
    4. Replace the original bucket with the newly created bucket.


For users running on a unix platform, the following shell commands (script) may be of use: (Note that we are combining the export and import step into a single operation using a pipe)

#!/bin/bash
BUCKET=$1

# Be sure to compare the imported/exported event count.  They should be the same.
exporttool ${BUCKET} /dev/stdout -csv meta::all | importtool ${BUCKET}.new /dev/stdin

# Make sure that bucket .tsidx files are optimized (and merged_lexicon.lex is up to date)
splunk-optimize ${BUCKET}.new
splunk-optimize-lex ${BUCKET}.new

# Compress all rawdata files that were not gziped by importtool
find ${BUCKET}.new/rawdata -name '[0-9]*[0-9]' -size +1k -print0 | xargs -0 -r gzip -v9

# Swap buckets
mv ${BUCKET} ${BUCKET}.old
mv ${BUCKET}.new ${BUCKET}

# Uncomment next line if you really want to remove the original bucket automatically
# rm -rf ${BUCKET}.old

Note: If you plan on using this script, please be sure to add return-code checking. You wouldn't want to remove the original bucket if the export/import failed to complete, for example.


Other considerations:

  • Keep in mind that importtool does not respect your segmentation settings. The default segmentation is used for all imported events. For many setups, this will not matter, but it is something to be aware of.
  • It's possible to loose data using this approach. This is a use-at-your-own-risk kind of operation. It's possible that you may not even reclaim all that much disk space using this approach.

View solution in original post

Contributor

Post dates on this are 2010 -- anyone know if they ever came up with that tool? (to reclaim space...)

I have 4.4 billion events -- an export/clean/import would be way ugly... 😉

Contributor

As of December 2011, Splunk 4.2.5 still does not provide this functionality. The docs still say "Note: Piping to delete does not reclaim disk space.". I heard this is still on the roadmap, but it's still not available.

0 Karma

Path Finder

not sure what you want to do exactly, but if deleting most of an index for which the logs are still around, you'd prob be better off deleting the index and reindexing the events that you want to

$SPLUNK_HOME/bin/splunk stop

$SPLUNK_HOME/bin/splunk clean eventdata -index myindex

$SPLUNK_HOME/bin/splunk start

Super Champion

Yes, the link to the docs in the question does mention that option too. If you want to delete almost everything in an index, then sure this would work. But this is NOT something you would want to do after running splunk for any considerable length of time. Also remember that re-indexing the log files would count towards your license usage. And you also have to use tricks to get splunk to re-read the log files you want to keep.

0 Karma

Super Champion

It is possible to reclaim disk space in this type of scenario by re-indexing the effected buckets.

Note: This may also be useful if you've deleted some sensitive information, such as a password, that really needs to be completely purged. This approach would prevent that indexed term from showing up in type-a-head, for example.

There are several steps to this process.

  1. Identify all buckets for each index that were effected by your deletion. (This alone can be a complicated task, also keep in mind that the delete command forces a bucket roll for hot buckets.)
  2. For each bucket, do the following:
    1. Export the bucket data to a .csv file
    2. Import the .csv file into a new empty bucket (with a temporary name/location)
    3. Optimize the new bucket.
    4. Replace the original bucket with the newly created bucket.


For users running on a unix platform, the following shell commands (script) may be of use: (Note that we are combining the export and import step into a single operation using a pipe)

#!/bin/bash
BUCKET=$1

# Be sure to compare the imported/exported event count.  They should be the same.
exporttool ${BUCKET} /dev/stdout -csv meta::all | importtool ${BUCKET}.new /dev/stdin

# Make sure that bucket .tsidx files are optimized (and merged_lexicon.lex is up to date)
splunk-optimize ${BUCKET}.new
splunk-optimize-lex ${BUCKET}.new

# Compress all rawdata files that were not gziped by importtool
find ${BUCKET}.new/rawdata -name '[0-9]*[0-9]' -size +1k -print0 | xargs -0 -r gzip -v9

# Swap buckets
mv ${BUCKET} ${BUCKET}.old
mv ${BUCKET}.new ${BUCKET}

# Uncomment next line if you really want to remove the original bucket automatically
# rm -rf ${BUCKET}.old

Note: If you plan on using this script, please be sure to add return-code checking. You wouldn't want to remove the original bucket if the export/import failed to complete, for example.


Other considerations:

  • Keep in mind that importtool does not respect your segmentation settings. The default segmentation is used for all imported events. For many setups, this will not matter, but it is something to be aware of.
  • It's possible to loose data using this approach. This is a use-at-your-own-risk kind of operation. It's possible that you may not even reclaim all that much disk space using this approach.

View solution in original post

State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!