Splunk Enterprise

What is the best way to estimate frozen storage sizing needs?

vanderaj2
Path Finder

Hello All,

I'm trying to assess some offline storage needs for archiving old Splunk data. I'm planning to adjust my retention policy to 90 days for hot-warm-cold (i.e. "online", searchable data) and then have anything older than 90 days sent to NAS as "frozen", to be stored there for 1 year.

My storage guy is asking how much storage I need on the NAS to cover 1 year of frozen data. My understanding is that compressed, raw events are what would be sent to frozen, if you specify a frozen path or a script.

How does one go about estimating the size of the raw, compressed events?

I have an indexer cluster, comprised of 2 indexers. Should I plan to double whatever the storage estimate is, to account for frozen data coming from 2 indexers?

Thank you in advance!

Tags (1)
0 Karma
1 Solution

s2_splunk
Splunk Employee
Splunk Employee

To get started, plug your numbers in here and it will give you your estimated storage needs based on "normal" compression assumptions (journal.gz = 15% of raw).
Note that if you are in a cluster, every indexer will freeze its own buckets, so you will have RF*raw on your archive volume. You can create a script that identifies replicated bucket archives and deletes all but one copy to minimize your storage need.

View solution in original post

0 Karma

vanderaj2
Path Finder

Thank you both for weighing in! I also have a follow-on question to the Splunk community:

Does anyone know whether during the compression of the raw data, Splunk does any data deduplication to reduce storage overhead? Just curious.....

0 Karma

s2_splunk
Splunk Employee
Splunk Employee

We don't deduplicate anything. The raw data file (journal.gz) is a ZIP file of zipped 128kb data slices.

0 Karma

vanderaj2
Path Finder

Thank you! Appreciate the info on this....

0 Karma

s2_splunk
Splunk Employee
Splunk Employee

To get started, plug your numbers in here and it will give you your estimated storage needs based on "normal" compression assumptions (journal.gz = 15% of raw).
Note that if you are in a cluster, every indexer will freeze its own buckets, so you will have RF*raw on your archive volume. You can create a script that identifies replicated bucket archives and deletes all but one copy to minimize your storage need.

0 Karma

jkat54
SplunkTrust
SplunkTrust

By default Splunk stores the replicated buckets and the searchable copies if coldToFrozenDir is specified. Therefore you can assume the following equation:

 (Daily Ingestion Volume * 0.35 * search factor) + (Daily ingestion Volume * 0.15 * replication factor) = total storage needed

Total storage needed / number of peers = storage per peer.

0 Karma

s2_splunk
Splunk Employee
Splunk Employee

Not quite. Index and metadata files are not frozen, only rawdata (journal.gz) is.
So (Daily ingestion Volume * 0.15 * replication factor) = total storage needed is the best approximation.
This can be reduced to just ingestion*.15 if replicated buckets are deleted after freezing via customer provided script.

0 Karma

jkat54
SplunkTrust
SplunkTrust

Oh my mistake I'm thinking of the rb and db files which are the raw data as you said...

So in a cluster both the replicated copies and the original copies get copied

0 Karma

lfedak_splunk
Splunk Employee
Splunk Employee

Hey @vanderaj2, Here's some documentation on planning your capacity: http://docs.splunk.com/Documentation/Splunk/6.6.3/Capacity/Estimateyourstoragerequirements. It says that "typically, the compressed rawdata file is 10% the size of the incoming, pre-indexed raw data. The associated index files range in size from approximately 10% to 110% of the rawdata file. The number of unique terms in the data affect this value. "

0 Karma
Get Updates on the Splunk Community!

Splunk Forwarders and Forced Time Based Load Balancing

Splunk customers use universal forwarders to collect and send data to Splunk. A universal forwarder can send ...

NEW! Log Views in Splunk Observability Dashboards Gives Context From a Single Page

Today, Splunk Observability releases log views, a new feature for users to add their logs data from Splunk Log ...

Last Chance to Submit Your Paper For BSides Splunk - Deadline is August 12th!

Hello everyone! Don't wait to submit - The deadline is August 12th! We have truly missed the community so ...