Splunk Enterprise

What is the best way to estimate frozen storage sizing needs?

vanderaj2
Path Finder

Hello All,

I'm trying to assess some offline storage needs for archiving old Splunk data. I'm planning to adjust my retention policy to 90 days for hot-warm-cold (i.e. "online", searchable data) and then have anything older than 90 days sent to NAS as "frozen", to be stored there for 1 year.

My storage guy is asking how much storage I need on the NAS to cover 1 year of frozen data. My understanding is that compressed, raw events are what would be sent to frozen, if you specify a frozen path or a script.

How does one go about estimating the size of the raw, compressed events?

I have an indexer cluster, comprised of 2 indexers. Should I plan to double whatever the storage estimate is, to account for frozen data coming from 2 indexers?

Thank you in advance!

Tags (1)
0 Karma
1 Solution

s2_splunk
Splunk Employee
Splunk Employee

To get started, plug your numbers in here and it will give you your estimated storage needs based on "normal" compression assumptions (journal.gz = 15% of raw).
Note that if you are in a cluster, every indexer will freeze its own buckets, so you will have RF*raw on your archive volume. You can create a script that identifies replicated bucket archives and deletes all but one copy to minimize your storage need.

View solution in original post

0 Karma

vanderaj2
Path Finder

Thank you both for weighing in! I also have a follow-on question to the Splunk community:

Does anyone know whether during the compression of the raw data, Splunk does any data deduplication to reduce storage overhead? Just curious.....

0 Karma

s2_splunk
Splunk Employee
Splunk Employee

We don't deduplicate anything. The raw data file (journal.gz) is a ZIP file of zipped 128kb data slices.

0 Karma

vanderaj2
Path Finder

Thank you! Appreciate the info on this....

0 Karma

s2_splunk
Splunk Employee
Splunk Employee

To get started, plug your numbers in here and it will give you your estimated storage needs based on "normal" compression assumptions (journal.gz = 15% of raw).
Note that if you are in a cluster, every indexer will freeze its own buckets, so you will have RF*raw on your archive volume. You can create a script that identifies replicated bucket archives and deletes all but one copy to minimize your storage need.

0 Karma

jkat54
SplunkTrust
SplunkTrust

By default Splunk stores the replicated buckets and the searchable copies if coldToFrozenDir is specified. Therefore you can assume the following equation:

 (Daily Ingestion Volume * 0.35 * search factor) + (Daily ingestion Volume * 0.15 * replication factor) = total storage needed

Total storage needed / number of peers = storage per peer.

0 Karma

s2_splunk
Splunk Employee
Splunk Employee

Not quite. Index and metadata files are not frozen, only rawdata (journal.gz) is.
So (Daily ingestion Volume * 0.15 * replication factor) = total storage needed is the best approximation.
This can be reduced to just ingestion*.15 if replicated buckets are deleted after freezing via customer provided script.

0 Karma

jkat54
SplunkTrust
SplunkTrust

Oh my mistake I'm thinking of the rb and db files which are the raw data as you said...

So in a cluster both the replicated copies and the original copies get copied

0 Karma

lfedak_splunk
Splunk Employee
Splunk Employee

Hey @vanderaj2, Here's some documentation on planning your capacity: http://docs.splunk.com/Documentation/Splunk/6.6.3/Capacity/Estimateyourstoragerequirements. It says that "typically, the compressed rawdata file is 10% the size of the incoming, pre-indexed raw data. The associated index files range in size from approximately 10% to 110% of the rawdata file. The number of unique terms in the data affect this value. "

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...