Splunk Enterprise

What files from an index bucket does hunk archive to hdfs or s3?

Jeremiah
Motivator

When Hunk archives data from a Splunk bucket to HDFS or S3, what exactly is it archiving? The entire bucket? Or just the rawdata file? Is there a formula we can use to calculate the amount of storage we would need in HDFS/S3 based on our bucket sizes and retention periods?

Tags (4)
0 Karma
1 Solution

cvervais
Path Finder

Just the raw data file. Sadly, there's not a search you can do to get a picture of how much space the raw data uses. I ended up whipping up a shell script to pull that data off indexers directly.

I've asked Splunk for an enhancement to make this more visible.

View solution in original post

rdagan_splunk
Splunk Employee
Splunk Employee

I took this JSON file = Hunkdata.json

Before Splunk Indexing:
671 MB

After Splunk Indexing (raw data + Index data):
463 MB = About 70% of original file

After Archiving it into HDFS (raw data + few metadata files):
157 MB = About 33% of Splunk indexer

cvervais
Path Finder

Just the raw data file. Sadly, there's not a search you can do to get a picture of how much space the raw data uses. I ended up whipping up a shell script to pull that data off indexers directly.

I've asked Splunk for an enhancement to make this more visible.

View solution in original post

rdagan_splunk
Splunk Employee
Splunk Employee

If you use the Hadoop Connect app you might be able to get a picture of how much space the raw data uses. Hadoop Connect includes the hdfs command, so you can use | hdfs lsr to calculate the space files are consuming in HDFS.
In this blog: http://blogs.splunk.com/2012/12/20/connecting-splunk-and-hadoop/ the last example might give you a guideline one how to create such a search.

0 Karma

Jeremiah
Motivator

Thanks! Is it stored compressed?

0 Karma

cvervais
Path Finder

Yup, it's stored compressed on the indexer and I'm 99% sure it stays compressed over in HDFS.

0 Karma

csharp_splunk
Splunk Employee
Splunk Employee

Yes, it's still compressed. Note that journal.gz is not just raw data, it's the journal of what gets written to the bucket, so it also contains metadata (not the lexicon) and is sufficient to rebuild the entire bucket. We also archive some of the .dat files out of the bucket as well. As a rule of thumb, we'll copy about 30-40% of the size of the bucket to HDFS.

0 Karma

Jeremiah
Motivator

Great, so basically it is the same data that would be archived by a cold2frozen script? Is it cluster aware, ie, a cluster of indexers won't archive multiple copies of the same buckets to s3?

0 Karma

csharp_splunk
Splunk Employee
Splunk Employee

Well, the cold2frozen script has the option of doing whatever it wants with the bucket. Yes, it is cluster aware. This is a big portion of the investment in this feature, BTW, this stuff ain't simple :).

0 Karma

Jeremiah
Motivator

Right, I was using the comparison between the two only in terms of the files you would typically archive with the cold2frozen script and the files archived by hunk. Thanks for the info!

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!