Deployment Architecture

Splunk Storage Sizing Guidelines and calculations

Ajinkya1992
Path Finder

Hi Team,
I have doubt with Splunk Storage Sizing apps
https://splunk-sizing.appspot.com/#ar=0&c=1&cf=0.15&cr=180&hwr=7&i=5&rf=1&sf=1&st=v&v=100

I am keeping it very simple, lets suppose need to ingest 100GB/Day
Data Rentntion is for 6 Months
Number of indexers in cluster 5
Search Factor 1
Replication Factor 1

As per Splunk Storage Sizing
Raw Compression Factor - Typically the compressed raw data file is 15% of the incoming pre-indexed data. The number of unique terms affects this value.
Metadata Size Factor - Typically metadata is 35% of raw data, The type of data and index files will affects this value

So as per the above calculation 15% of 100GB = 15GB
and 35% of 15GB = 5.25FB
which is 20.25GB for 5 Servers/Day and 4.05GB/Day for 1 server

So if we are considering retention period of 180 Days then 4.05*180 = 729GB/Server for Six months and 3645GB (3.6TB) for 5 servers

But as per the Splunk Storage Sizing
You need to have 1.8TB/server and 9.1TB for 5 servers.

My calcualtion and Splunk Storage Sizing calculation doesnt match at all.
Splunk Storage sizing calculation goes with 50% of preindexed data completely, where as per their guidelines metadata is 35% of raw data not actual incoming data.

Please let me know what I am missing.

0 Karma

edoardo_vicendo
Builder

Things have improved a lot thanks to tsidxWritingLevel enhancements.

If you set tsidxWritingLevel=4, the maximum available today, and all your buckets have been already written with this level you can achieve a compress ratio of 5.35:1

This means 55 TB of raw logs will occupy around 10 TB (tsidx + raw) on disk.

At least this is what we have in our deployment.

This number can vary depending on the type of data you are ingesting.

Here the query I used, running All Time, starting from the one present in the Monitoring Console >> Indexing >> Index and Volumes >> Index Detail: Instance

 

| rest splunk_server=<oneOfYourIndexers> /services/data/indexes datatype=all
  | join type=outer title [
    | rest splunk_server=<oneOfYourIndexers> /services/data/indexes-extended datatype=all
  ]
| `dmc_exclude_indexes`
| eval warm_bucket_size = coalesce('bucket_dirs.home.warm_bucket_size', 'bucket_dirs.home.size')
| eval cold_bucket_size = coalesce('bucket_dirs.cold.bucket_size', 'bucket_dirs.cold.size')
| eval hot_bucket_size = if(isnotnull(cold_bucket_size), total_size - cold_bucket_size - warm_bucket_size, total_size - warm_bucket_size)
| eval thawed_bucket_size = coalesce('bucket_dirs.thawed.bucket_size', 'bucket_dirs.thawed.size')
| eval warm_bucket_size_gb = coalesce(round(warm_bucket_size / 1024, 2), 0.00)
| eval hot_bucket_size_gb = coalesce(round(hot_bucket_size / 1024, 2), 0.00)
| eval cold_bucket_size_gb = coalesce(round(cold_bucket_size / 1024, 2), 0.00)
| eval thawed_bucket_size_gb = coalesce(round(thawed_bucket_size / 1024, 2), 0.00)

| eval warm_bucket_count = coalesce('bucket_dirs.home.warm_bucket_count', 0)
| eval hot_bucket_count = coalesce('bucket_dirs.home.hot_bucket_count', 0)
| eval cold_bucket_count = coalesce('bucket_dirs.cold.bucket_count', 0)
| eval thawed_bucket_count = coalesce('bucket_dirs.thawed.bucket_count', 0)
| eval home_event_count = coalesce('bucket_dirs.home.event_count', 0)
| eval cold_event_count = coalesce('bucket_dirs.cold.event_count', 0)
| eval thawed_event_count = coalesce('bucket_dirs.thawed.event_count', 0)

| eval home_bucket_size_gb = coalesce(round((warm_bucket_size + hot_bucket_size) / 1024, 2), 0.00)
| eval homeBucketMaxSizeGB = coalesce(round('homePath.maxDataSizeMB' / 1024, 2), 0.00)
| eval home_bucket_capacity_gb = if(homeBucketMaxSizeGB > 0, homeBucketMaxSizeGB, "unlimited")
| eval home_bucket_usage_gb = home_bucket_size_gb." / ".home_bucket_capacity_gb
| eval cold_bucket_capacity_gb = coalesce(round('coldPath.maxDataSizeMB' / 1024, 2), 0.00)
| eval cold_bucket_capacity_gb = if(cold_bucket_capacity_gb > 0, cold_bucket_capacity_gb, "unlimited")
| eval cold_bucket_usage_gb = cold_bucket_size_gb." / ".cold_bucket_capacity_gb

| eval currentDBSizeGB = round(currentDBSizeMB / 1024, 2)
| eval maxTotalDataSizeGB = if(maxTotalDataSizeMB > 0, round(maxTotalDataSizeMB / 1024, 2), "unlimited")
| eval disk_usage_gb = currentDBSizeGB." / ".maxTotalDataSizeGB

| eval currentTimePeriodDay = coalesce(round((now() - strptime(minTime,"%Y-%m-%dT%H:%M:%S%z")) / 86400, 0), 0)
| eval frozenTimePeriodDay = coalesce(round(frozenTimePeriodInSecs / 86400, 0), 0)
| eval frozenTimePeriodDay = if(frozenTimePeriodDay > 0, frozenTimePeriodDay, "unlimited")
| eval freeze_period_viz_day = currentTimePeriodDay." / ".frozenTimePeriodDay

| eval total_bucket_count = toString(coalesce(total_bucket_count, 0), "commas")
| eval totalEventCount = toString(coalesce(totalEventCount, 0), "commas")
| eval total_raw_size_gb = round(total_raw_size / 1024, 2)
| eval avg_bucket_size_gb = round(currentDBSizeGB / total_bucket_count, 2)
| eval compress_ratio = round(total_raw_size_gb / currentDBSizeGB, 2)." : 1"

| fields title, datatype
    currentDBSizeGB, totalEventCount, total_bucket_count,  avg_bucket_size_gb,
    total_raw_size_gb, compress_ratio, minTime, maxTime
    freeze_period_viz_day, disk_usage_gb, home_bucket_usage_gb, cold_bucket_usage_gb,
    hot_bucket_size_gb, warm_bucket_size_gb, cold_bucket_size_gb, thawed_bucket_size_gb,
    hot_bucket_count,   warm_bucket_count,   cold_bucket_count,   thawed_bucket_count,
    home_event_count,   cold_event_count,    thawed_event_count,
    homePath, homePath_expanded, coldPath, coldPath_expanded, thawedPath, thawedPath_expanded, summaryHomePath_expanded, tstatsHomePath, tstatsHomePath_expanded,
    maxTotalDataSizeMB, frozenTimePeriodInSecs, homePath.maxDataSizeMB, coldPath.maxDataSizeMB,
    maxDataSize, maxHotBuckets, maxWarmDBCount | search title=* | table title currentDBSizeGB total_raw_size_gb compress_ratio | where isnotnull(total_raw_size_gb) | where isnotnull(compress_ratio)
    | stats sum(currentDBSizeGB) as currentDBSizeGB, sum(total_raw_size_gb) as total_raw_size_gb | eval compress_ratio = round(total_raw_size_gb / currentDBSizeGB, 2)." : 1"

 

 

 

 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

The 15% and 35% calculations should be made on the same raw daily ingestion value. An easier method is to take %50 of the daily ingestion value as the daily storage requirement.

---
If this reply helps you, Karma would be appreciated.

Ajinkya1992
Path Finder

Thank you so much Rich for your reply.
But then again I just came across this document which says "Typically, the compressed rawdata file is 10% the size of the incoming, pre-indexed raw data. The associated index files range in size from approximately 10% to 110% of the rawdata file. The number of unique terms in the data affect this value."

https://docs.splunk.com/Documentation/Splunk/7.2.6/Capacity/Estimateyourstoragerequirements

Can you please brief me what does it mean exactly which is in documents?
Because as per the documents i guess below will be the calculation
Actual Data = Raw Data + Index Files
100 GB = 10GB (10% of actual data) + ((1GB to 11GB) {10% to 110% of rawdata}))
= 11 GB to 21GB
= 25 GB @round off figure for 5 servers(Considering higher value with round off figure)
= 5 GB / server
5*180 = 900 GB / Server and 25 * 180 = 4.5 TB for 5 servers

I might be wrong but just could not match the documentation part

Because as per the Splunk Storage Sizing, size of index files(which are having only pointers for your indexed data i believe) is more than size of your actual indexed data(rawdata)
Isn't it sounds something unusual? I guess indexed data should be bigger than index files.

0 Karma
Get Updates on the Splunk Community!

Buttercup Games Tutorial Extension - part 9

This series of blogs assumes you have already completed the Splunk Enterprise Search Tutorial as it uses the ...

Buttercup Games Tutorial Extension - part 8

This series of blogs assumes you have already completed the Splunk Enterprise Search Tutorial as it uses the ...

Introducing the Splunk Developer Program!

Hey Splunk community! We are excited to announce that Splunk is launching the Splunk Developer Program in ...