Knowledge Management

Calculate disk storage per month

ips_mandar
Builder

I want to calculate what disk storage is required per each month for indexing rate-300GB/day and retention policy=12 months
I have formula but not sure how to use it:

( Daily average indexing rate ) x ( retention policy ) x 1/2

I have already gone through it: http://docs.splunk.com/Documentation/Splunk/7.0.3/Capacity/HowSplunkcalculatesdiskstorage

any help on this will be great..
Thanks.

0 Karma
1 Solution

acharlieh
Influencer

Ok, so let's start that you are ingesting 300 GB/day. As Splunk compresses the raw data that it stores, what that page is saying is that an ingestion of 300 GB/day of logs could be stored on disk at a rate of 150 GB/day or so. (Note, that this compression ratio is a general estimate, and is highly dependent on the type of data you're ingesting as to how well it compresses, and how many terms Splunk pulls out into metadata).

Now let's assume you have a single server, and you just want to store these logs (in Splunk) for 365 days, by default that would be 150GB/day * 365 days -> 53.5 TB of storage at the end of the first year, at which point the first logs will start being frozen based on time (by default - deleted), as new ones come in. If you're wanting to ramp up to that amount... the first month you'll need 150GB/day * 30 days worth of data -> 4.4 TB, during the second month you'll need a total of (150GB/day * 60 days worth of data) -> 8.8 TB total (as you're keeping the first month in addition to the second month) and so on.

If you're not deleting data at the end of your Splunk retention, then you'll need to figure in the amount of time that you keep the data in frozen storage on disk. Splunk itself doesn't manage the lifecycle of data after it has been frozen.

When you introduce features like Indexer Clustering, this gets more complex, as you now store multiple copies of the raw data (replication factor or RF - estimated at 15% of the raw size) and the search metadata (search factor or SF - estimated as 35% of the raw size) across multiple servers, providing you with nice safety guarantees... Let's say you have a cluster with SF=3 and RF=3 and keeping the same amount of data... your 53.5 TB has now turned into 160.4 TB in total disk space... If you have a cluster of 7 indexers this means around 22.9 TB per indexer for the data storage alone.

We can add other features to your environment like Data Model Acceleration, Report Acceleration, and Summary Indexing where Splunk is automatically or on a schedule running searches and gathering statistics about your data so that your searches can leverage those statistics and perform faster/better/for longer time periods on demand at the cost of execution time in off hours, as well as some additional disk space to store the summary data.

There's also a possibility of reducing space required on disk through TSIDX Reduction, where after a certain age Splunk throws away portions of the search metadata, saving disk, at the cost of possibly requiring rebuilding the same metadata during a search whose time period crosses a particular threshold.

If you want to play with some of the basic storage options, there's a tool available at https://splunk-sizing.appspot.com/
But I don't know of any good way to estimate storage requirements for leveraging any of these other features that I've mentioned.

View solution in original post

acharlieh
Influencer

Ok, so let's start that you are ingesting 300 GB/day. As Splunk compresses the raw data that it stores, what that page is saying is that an ingestion of 300 GB/day of logs could be stored on disk at a rate of 150 GB/day or so. (Note, that this compression ratio is a general estimate, and is highly dependent on the type of data you're ingesting as to how well it compresses, and how many terms Splunk pulls out into metadata).

Now let's assume you have a single server, and you just want to store these logs (in Splunk) for 365 days, by default that would be 150GB/day * 365 days -> 53.5 TB of storage at the end of the first year, at which point the first logs will start being frozen based on time (by default - deleted), as new ones come in. If you're wanting to ramp up to that amount... the first month you'll need 150GB/day * 30 days worth of data -> 4.4 TB, during the second month you'll need a total of (150GB/day * 60 days worth of data) -> 8.8 TB total (as you're keeping the first month in addition to the second month) and so on.

If you're not deleting data at the end of your Splunk retention, then you'll need to figure in the amount of time that you keep the data in frozen storage on disk. Splunk itself doesn't manage the lifecycle of data after it has been frozen.

When you introduce features like Indexer Clustering, this gets more complex, as you now store multiple copies of the raw data (replication factor or RF - estimated at 15% of the raw size) and the search metadata (search factor or SF - estimated as 35% of the raw size) across multiple servers, providing you with nice safety guarantees... Let's say you have a cluster with SF=3 and RF=3 and keeping the same amount of data... your 53.5 TB has now turned into 160.4 TB in total disk space... If you have a cluster of 7 indexers this means around 22.9 TB per indexer for the data storage alone.

We can add other features to your environment like Data Model Acceleration, Report Acceleration, and Summary Indexing where Splunk is automatically or on a schedule running searches and gathering statistics about your data so that your searches can leverage those statistics and perform faster/better/for longer time periods on demand at the cost of execution time in off hours, as well as some additional disk space to store the summary data.

There's also a possibility of reducing space required on disk through TSIDX Reduction, where after a certain age Splunk throws away portions of the search metadata, saving disk, at the cost of possibly requiring rebuilding the same metadata during a search whose time period crosses a particular threshold.

If you want to play with some of the basic storage options, there's a tool available at https://splunk-sizing.appspot.com/
But I don't know of any good way to estimate storage requirements for leveraging any of these other features that I've mentioned.

ips_mandar
Builder

Thanks for detailed explanation...so there are multiple factors we need to consider while calculating disk storage like in clustered environment- Replication Factor, Search Factor etc.
the thing I won't understand is if ingestion of 300 GB/day of logs would be stored at a 150 GB/day i.e. half of GB/day is indexing rate...as I thought compressed raw data is 15% size of incoming...

0 Karma

acharlieh
Influencer

Yes, Raw data is typically compressed to 15% of the ingestion size for storage (Which the number of these copies in a cluster is replication factor) but the metadata that enables your Splunk searches to perform well is typically about 35% of the ingestion size (number of copies of this data is search factor)...

On a single instance with "typical data" and no other considerations you store approximately 15% of ingestion for the raw data + 35% of ingestion for the metadata = you use 50% of ingestion in disk space

0 Karma

ips_mandar
Builder

thanks..one more thing I won't understand is you have mentioned "If I have a cluster of 7 indexers this means around 22.9 TB per indexer for the data storage alone" here how 22.9 TB is calculated?

0 Karma

Moreilly97
Path Finder

Index Clusters are used to replicate the data so there are multiple copies.
So if you have 160.4 TB in total disk space, but 7 clusters, then 160.4 / 7 = 22.9 TB per Indexer

0 Karma

ips_mandar
Builder

if each indexer needs to keep raw data then each indexer disk space would be 53.5 TB... isn't it? as each indexer needs to keep primary as well as replicated bucket data...

0 Karma

splunker12er
Motivator

You need to know the expected EPS ( events per second) and the size of the log - based on that you can perform you calculations.

you can try this link : http://www.buzzcircuit.com/simple-log-storage-calculator/

0 Karma

ips_mandar
Builder

Thanks for reply..I have gone through link you provided but it doesn't seems specific to splunk...and my log size is 300GB daily.

0 Karma
Get Updates on the Splunk Community!

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...

Observability protocols to know about

Observability protocols define the specifications or formats for collecting, encoding, transporting, and ...

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...