Solved: What is the best way to estimate frozen storage si...

vanderaj2 · ‎08-31-2017

Hello All,

I'm trying to assess some offline storage needs for archiving old Splunk data. I'm planning to adjust my retention policy to 90 days for hot-warm-cold (i.e. "online", searchable data) and then have anything older than 90 days sent to NAS as "frozen", to be stored there for 1 year.

My storage guy is asking how much storage I need on the NAS to cover 1 year of frozen data. My understanding is that compressed, raw events are what would be sent to frozen, if you specify a frozen path or a script.

How does one go about estimating the size of the raw, compressed events?

I have an indexer cluster, comprised of 2 indexers. Should I plan to double whatever the storage estimate is, to account for frozen data coming from 2 indexers?

Thank you in advance!

s2_splunk · ‎08-31-2017

To get started, plug your numbers in here and it will give you your estimated storage needs based on "normal" compression assumptions (journal.gz = 15% of raw).
Note that if you are in a cluster, every indexer will freeze its own buckets, so you will have RF*raw on your archive volume. You can create a script that identifies replicated bucket archives and deletes all but one copy to minimize your storage need.

View solution in original post

vanderaj2 · ‎09-05-2017

Thank you both for weighing in! I also have a follow-on question to the Splunk community:

Does anyone know whether during the compression of the raw data, Splunk does any data deduplication to reduce storage overhead? Just curious.....

s2_splunk · ‎09-05-2017

We don't deduplicate anything. The raw data file (journal.gz) is a ZIP file of zipped 128kb data slices.

vanderaj2 · ‎09-06-2017

Thank you! Appreciate the info on this....

s2_splunk · ‎08-31-2017

To get started, plug your numbers in here and it will give you your estimated storage needs based on "normal" compression assumptions (journal.gz = 15% of raw).
Note that if you are in a cluster, every indexer will freeze its own buckets, so you will have RF*raw on your archive volume. You can create a script that identifies replicated bucket archives and deletes all but one copy to minimize your storage need.

jkat54 · ‎09-06-2017

By default Splunk stores the replicated buckets and the searchable copies if coldToFrozenDir is specified. Therefore you can assume the following equation:

 (Daily Ingestion Volume * 0.35 * search factor) + (Daily ingestion Volume * 0.15 * replication factor) = total storage needed

Total storage needed / number of peers = storage per peer.

s2_splunk · ‎09-06-2017

Not quite. Index and metadata files are not frozen, only rawdata (journal.gz) is.
So (Daily ingestion Volume * 0.15 * replication factor) = total storage needed is the best approximation.
This can be reduced to just ingestion*.15 if replicated buckets are deleted after freezing via customer provided script.

jkat54 · ‎09-06-2017

Oh my mistake I'm thinking of the rb and db files which are the raw data as you said...

So in a cluster both the replicated copies and the original copies get copied

lfedak_splunk · ‎08-31-2017

Hey @vanderaj2, Here's some documentation on planning your capacity: http://docs.splunk.com/Documentation/Splunk/6.6.3/Capacity/Estimateyourstoragerequirements. It says that "typically, the compressed rawdata file is 10% the size of the incoming, pre-indexed raw data. The associated index files range in size from approximately 10% to 110% of the rawdata file. The number of unique terms in the data affect this value. "

What is the best way to estimate frozen storage sizing needs?

Splunk Search APIを使えば調査過程が残せます

Integrating Splunk Search API and Quarto to Create Reproducible Investigation ...

Congratulations to the 2025-2026 SplunkTrust!

Join the Conversation

What is the best way to estimate frozen storage sizing needs?

Splunk Search APIを使えば調査過程が残せます

Integrating Splunk Search API and Quarto to Create Reproducible Investigation ...

Congratulations to the 2025-2026 SplunkTrust!