Solved: What files from an index bucket does hunk archive ...

Jeremiah · ‎01-31-2015

When Hunk archives data from a Splunk bucket to HDFS or S3, what exactly is it archiving? The entire bucket? Or just the rawdata file? Is there a formula we can use to calculate the amount of storage we would need in HDFS/S3 based on our bucket sizes and retention periods?

cvervais · ‎01-31-2015

Just the raw data file. Sadly, there's not a search you can do to get a picture of how much space the raw data uses. I ended up whipping up a shell script to pull that data off indexers directly.

I've asked Splunk for an enhancement to make this more visible.

View solution in original post

rdagan_splunk · ‎02-13-2015

I took this JSON file = Hunkdata.json

Before Splunk Indexing:
671 MB

After Splunk Indexing (raw data + Index data):
463 MB = About 70% of original file

After Archiving it into HDFS (raw data + few metadata files):
157 MB = About 33% of Splunk indexer

cvervais · ‎01-31-2015

Just the raw data file. Sadly, there's not a search you can do to get a picture of how much space the raw data uses. I ended up whipping up a shell script to pull that data off indexers directly.

I've asked Splunk for an enhancement to make this more visible.

rdagan_splunk · ‎02-07-2015

If you use the Hadoop Connect app you might be able to get a picture of how much space the raw data uses. Hadoop Connect includes the hdfs command, so you can use | hdfs lsr to calculate the space files are consuming in HDFS.
In this blog: http://blogs.splunk.com/2012/12/20/connecting-splunk-and-hadoop/ the last example might give you a guideline one how to create such a search.

Jeremiah · ‎01-31-2015

Thanks! Is it stored compressed?

cvervais · ‎01-31-2015

Yup, it's stored compressed on the indexer and I'm 99% sure it stays compressed over in HDFS.

csharp_splunk · ‎01-31-2015

Yes, it's still compressed. Note that journal.gz is not just raw data, it's the journal of what gets written to the bucket, so it also contains metadata (not the lexicon) and is sufficient to rebuild the entire bucket. We also archive some of the .dat files out of the bucket as well. As a rule of thumb, we'll copy about 30-40% of the size of the bucket to HDFS.

Jeremiah · ‎02-01-2015

Great, so basically it is the same data that would be archived by a cold2frozen script? Is it cluster aware, ie, a cluster of indexers won't archive multiple copies of the same buckets to s3?

csharp_splunk · ‎02-02-2015

Well, the cold2frozen script has the option of doing whatever it wants with the bucket. Yes, it is cluster aware. This is a big portion of the investment in this feature, BTW, this stuff ain't simple :).

Jeremiah · ‎02-07-2015

Right, I was using the comparison between the two only in terms of the files you would typically archive with the cold2frozen script and the files archived by hunk. Thanks for the info!

What files from an index bucket does hunk archive to hdfs or s3?

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation

What files from an index bucket does hunk archive to hdfs or s3?

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...