topic Re: What is the best way to estimate frozen storage sizing needs? in Splunk Enterprise

What is the best way to estimate frozen storage sizing needs?

vanderaj2 — Thu, 31 Aug 2017 19:29:09 GMT

Hello All,

I'm trying to assess some offline storage needs for archiving old Splunk data. I'm planning to adjust my retention policy to 90 days for hot-warm-cold (i.e. "online", searchable data) and then have anything older than 90 days sent to NAS as "frozen", to be stored there for 1 year.

My storage guy is asking how much storage I need on the NAS to cover 1 year of frozen data. My understanding is that compressed, raw events are what would be sent to frozen, if you specify a frozen path or a script.

How does one go about estimating the size of the raw, compressed events?

I have an indexer cluster, comprised of 2 indexers. Should I plan to double whatever the storage estimate is, to account for frozen data coming from 2 indexers?

Thank you in advance!

Re: What is the best way to estimate frozen storage sizing needs?

lfedak_splunk — Thu, 31 Aug 2017 19:44:02 GMT

Hey @vanderaj2, Here's some documentation on planning your capacity: http://docs.splunk.com/Documentation/Splunk/6.6.3/Capacity/Estimateyourstoragerequirements. It says that "typically, the compressed rawdata file is 10% the size of the incoming, pre-indexed raw data. The associated index files range in size from approximately 10% to 110% of the rawdata file. The number of unique terms in the data affect this value. "

Re: What is the best way to estimate frozen storage sizing needs?

s2_splunk — Thu, 31 Aug 2017 20:54:44 GMT

To get started, plug your numbers in here and it will give you your estimated storage needs based on "normal" compression assumptions (journal.gz = 15% of raw).
Note that if you are in a cluster, every indexer will freeze its own buckets, so you will have RF*raw on your archive volume. You can create a script that identifies replicated bucket archives and deletes all but one copy to minimize your storage need.

Re: What is the best way to estimate frozen storage sizing needs?

vanderaj2 — Tue, 05 Sep 2017 18:32:17 GMT

Thank you both for weighing in! I also have a follow-on question to the Splunk community:

Does anyone know whether during the compression of the raw data, Splunk does any data deduplication to reduce storage overhead? Just curious.....

Re: What is the best way to estimate frozen storage sizing needs?

s2_splunk — Tue, 05 Sep 2017 19:37:34 GMT

We don't deduplicate anything. The raw data file (journal.gz) is a ZIP file of zipped 128kb data slices.

Re: What is the best way to estimate frozen storage sizing needs?

vanderaj2 — Wed, 06 Sep 2017 18:08:04 GMT

Thank you! Appreciate the info on this....

Re: What is the best way to estimate frozen storage sizing needs?

jkat54 — Wed, 06 Sep 2017 18:13:12 GMT

By default Splunk stores the replicated buckets and the searchable copies if coldToFrozenDir is specified. Therefore you can assume the following equation:

 (Daily Ingestion Volume * 0.35 * search factor) + (Daily ingestion Volume * 0.15 * replication factor) = total storage needed

Total storage needed / number of peers = storage per peer.

Re: What is the best way to estimate frozen storage sizing needs?

s2_splunk — Wed, 06 Sep 2017 18:29:09 GMT

Not quite. Index and metadata files are not frozen, only rawdata (journal.gz) is.
So (Daily ingestion Volume * 0.15 * replication factor) = total storage needed is the best approximation.
This can be reduced to just ingestion*.15 if replicated buckets are deleted after freezing via customer provided script.

Re: What is the best way to estimate frozen storage sizing needs?

jkat54 — Wed, 06 Sep 2017 19:01:51 GMT

Oh my mistake I'm thinking of the rb and db files which are the raw data as you said...

So in a cluster both the replicated copies and the original copies get copied