Splunk Storage Sizing Guidelines and calculations

Ajinkya1992 · ‎05-05-2019

Hi Team,
I have doubt with Splunk Storage Sizing apps
https://splunk-sizing.appspot.com/#ar=0&c=1&cf=0.15&cr=180&hwr=7&i=5&rf=1&sf=1&st=v&v=100

I am keeping it very simple, lets suppose need to ingest 100GB/Day
Data Rentntion is for 6 Months
Number of indexers in cluster 5
Search Factor 1
Replication Factor 1

As per Splunk Storage Sizing
Raw Compression Factor - Typically the compressed raw data file is 15% of the incoming pre-indexed data. The number of unique terms affects this value.
Metadata Size Factor - Typically metadata is 35% of raw data, The type of data and index files will affects this value

So as per the above calculation 15% of 100GB = 15GB
and 35% of 15GB = 5.25FB
which is 20.25GB for 5 Servers/Day and 4.05GB/Day for 1 server

So if we are considering retention period of 180 Days then 4.05*180 = 729GB/Server for Six months and 3645GB (3.6TB) for 5 servers

But as per the Splunk Storage Sizing
You need to have 1.8TB/server and 9.1TB for 5 servers.

My calcualtion and Splunk Storage Sizing calculation doesnt match at all.
Splunk Storage sizing calculation goes with 50% of preindexed data completely, where as per their guidelines metadata is 35% of raw data not actual incoming data.

Please let me know what I am missing.

edoardo_vicendo · ‎01-08-2025

Things have improved a lot thanks to tsidxWritingLevel enhancements.

If you set tsidxWritingLevel=4, the maximum available today, and all your buckets have been already written with this level you can achieve a compress ratio of 5.35:1

This means 55 TB of raw logs will occupy around 10 TB (tsidx + raw) on disk.

At least this is what we have in our deployment.

This number can vary depending on the type of data you are ingesting.

Here the query I used, running All Time, starting from the one present in the Monitoring Console >> Indexing >> Index and Volumes >> Index Detail: Instance

| rest splunk_server=<oneOfYourIndexers> /services/data/indexes datatype=all
  | join type=outer title [
    | rest splunk_server=<oneOfYourIndexers> /services/data/indexes-extended datatype=all
  ]
| `dmc_exclude_indexes`
| eval warm_bucket_size = coalesce('bucket_dirs.home.warm_bucket_size', 'bucket_dirs.home.size')
| eval cold_bucket_size = coalesce('bucket_dirs.cold.bucket_size', 'bucket_dirs.cold.size')
| eval hot_bucket_size = if(isnotnull(cold_bucket_size), total_size - cold_bucket_size - warm_bucket_size, total_size - warm_bucket_size)
| eval thawed_bucket_size = coalesce('bucket_dirs.thawed.bucket_size', 'bucket_dirs.thawed.size')
| eval warm_bucket_size_gb = coalesce(round(warm_bucket_size / 1024, 2), 0.00)
| eval hot_bucket_size_gb = coalesce(round(hot_bucket_size / 1024, 2), 0.00)
| eval cold_bucket_size_gb = coalesce(round(cold_bucket_size / 1024, 2), 0.00)
| eval thawed_bucket_size_gb = coalesce(round(thawed_bucket_size / 1024, 2), 0.00)

| eval warm_bucket_count = coalesce('bucket_dirs.home.warm_bucket_count', 0)
| eval hot_bucket_count = coalesce('bucket_dirs.home.hot_bucket_count', 0)
| eval cold_bucket_count = coalesce('bucket_dirs.cold.bucket_count', 0)
| eval thawed_bucket_count = coalesce('bucket_dirs.thawed.bucket_count', 0)
| eval home_event_count = coalesce('bucket_dirs.home.event_count', 0)
| eval cold_event_count = coalesce('bucket_dirs.cold.event_count', 0)
| eval thawed_event_count = coalesce('bucket_dirs.thawed.event_count', 0)

| eval home_bucket_size_gb = coalesce(round((warm_bucket_size + hot_bucket_size) / 1024, 2), 0.00)
| eval homeBucketMaxSizeGB = coalesce(round('homePath.maxDataSizeMB' / 1024, 2), 0.00)
| eval home_bucket_capacity_gb = if(homeBucketMaxSizeGB > 0, homeBucketMaxSizeGB, "unlimited")
| eval home_bucket_usage_gb = home_bucket_size_gb." / ".home_bucket_capacity_gb
| eval cold_bucket_capacity_gb = coalesce(round('coldPath.maxDataSizeMB' / 1024, 2), 0.00)
| eval cold_bucket_capacity_gb = if(cold_bucket_capacity_gb > 0, cold_bucket_capacity_gb, "unlimited")
| eval cold_bucket_usage_gb = cold_bucket_size_gb." / ".cold_bucket_capacity_gb

| eval currentDBSizeGB = round(currentDBSizeMB / 1024, 2)
| eval maxTotalDataSizeGB = if(maxTotalDataSizeMB > 0, round(maxTotalDataSizeMB / 1024, 2), "unlimited")
| eval disk_usage_gb = currentDBSizeGB." / ".maxTotalDataSizeGB

| eval currentTimePeriodDay = coalesce(round((now() - strptime(minTime,"%Y-%m-%dT%H:%M:%S%z")) / 86400, 0), 0)
| eval frozenTimePeriodDay = coalesce(round(frozenTimePeriodInSecs / 86400, 0), 0)
| eval frozenTimePeriodDay = if(frozenTimePeriodDay > 0, frozenTimePeriodDay, "unlimited")
| eval freeze_period_viz_day = currentTimePeriodDay." / ".frozenTimePeriodDay

| eval total_bucket_count = toString(coalesce(total_bucket_count, 0), "commas")
| eval totalEventCount = toString(coalesce(totalEventCount, 0), "commas")
| eval total_raw_size_gb = round(total_raw_size / 1024, 2)
| eval avg_bucket_size_gb = round(currentDBSizeGB / total_bucket_count, 2)
| eval compress_ratio = round(total_raw_size_gb / currentDBSizeGB, 2)." : 1"

| fields title, datatype
    currentDBSizeGB, totalEventCount, total_bucket_count,  avg_bucket_size_gb,
    total_raw_size_gb, compress_ratio, minTime, maxTime
    freeze_period_viz_day, disk_usage_gb, home_bucket_usage_gb, cold_bucket_usage_gb,
    hot_bucket_size_gb, warm_bucket_size_gb, cold_bucket_size_gb, thawed_bucket_size_gb,
    hot_bucket_count,   warm_bucket_count,   cold_bucket_count,   thawed_bucket_count,
    home_event_count,   cold_event_count,    thawed_event_count,
    homePath, homePath_expanded, coldPath, coldPath_expanded, thawedPath, thawedPath_expanded, summaryHomePath_expanded, tstatsHomePath, tstatsHomePath_expanded,
    maxTotalDataSizeMB, frozenTimePeriodInSecs, homePath.maxDataSizeMB, coldPath.maxDataSizeMB,
    maxDataSize, maxHotBuckets, maxWarmDBCount | search title=* | table title currentDBSizeGB total_raw_size_gb compress_ratio | where isnotnull(total_raw_size_gb) | where isnotnull(compress_ratio)
    | stats sum(currentDBSizeGB) as currentDBSizeGB, sum(total_raw_size_gb) as total_raw_size_gb | eval compress_ratio = round(total_raw_size_gb / currentDBSizeGB, 2)." : 1"

richgalloway · ‎05-05-2019

The 15% and 35% calculations should be made on the same raw daily ingestion value. An easier method is to take %50 of the daily ingestion value as the daily storage requirement.

---
If this reply helps you, Karma would be appreciated.

Ajinkya1992 · ‎05-06-2019

Thank you so much Rich for your reply.
But then again I just came across this document which says "Typically, the compressed rawdata file is 10% the size of the incoming, pre-indexed raw data. The associated index files range in size from approximately 10% to 110% of the rawdata file. The number of unique terms in the data affect this value."

https://docs.splunk.com/Documentation/Splunk/7.2.6/Capacity/Estimateyourstoragerequirements

Can you please brief me what does it mean exactly which is in documents?
Because as per the documents i guess below will be the calculation
Actual Data = Raw Data + Index Files
100 GB = 10GB (10% of actual data) + ((1GB to 11GB) {10% to 110% of rawdata}))
= 11 GB to 21GB
= 25 GB @round off figure for 5 servers(Considering higher value with round off figure)
= 5 GB / server
5*180 = 900 GB / Server and 25 * 180 = 4.5 TB for 5 servers

I might be wrong but just could not match the documentation part

Because as per the Splunk Storage Sizing, size of index files(which are having only pointers for your indexed data i believe) is more than size of your actual indexed data(rawdata)
Isn't it sounds something unusual? I guess indexed data should be bigger than index files.

Splunk Storage Sizing Guidelines and calculations

Splunk Mobile: Your Brand-New Home Screen

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Enterprise Security (ES) Essentials 8.3 is Now GA — Smarter Detections, Faster ...

Are you a member of the Splunk Community?

Splunk Storage Sizing Guidelines and calculations

Splunk Mobile: Your Brand-New Home Screen

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Enterprise Security (ES) Essentials 8.3 is Now GA — Smarter Detections, Faster ...