Getting Data In

Per Index Configuration

edwardrose
Contributor

Hello All,

I am trying to clean up our indexes and their sizes to ensure that we are keeping the correct amount of data for each index.  I have about 5 to 10 really busy indexes that bring in most of the data.

 

pan_logs~200GB/day
syslog

~10GB/day

checkpoing (coming soon)~250GB/day 
wineventlog~650GB/day
network~180GB/day

So question is if when I create an index configuration for example wineventlog

 

 

[wineventlog]
homePath = volume:hot/wineventlog/db
homePath.maxDataSizeMB = 19500000
coldPath = volume:cold/wineventlog/colddb
coldPath.maxDataSizeMB = 58500000
thawedPath = /splunk/cold/wineventlog/thaweddb
maxHotBuckets = 10
maxDataSize = auto_high_volume
maxTotalDataSizeMB = 78000000
disabled = 0
repFactor=auto

 

 

So 30 days of hot/warm would be 1.95TB and 90days of cold data would be 5.85TB and the total size would be 78TB data.  The sizes would then be divided by the total number of indexers we have (20) and each indexer should host about 975GB of hot/warm and 2.925TB of cold data.  And Splunk would start to roll data to frozen (dev null) when the max total (Hot/Warm + Cold) data reached 78TB.  Is that correct? Do I need to specify maxTotalDataSizeMB if I am using homePath and coldPath settings?

 

Thanks

ed

Labels (2)
0 Karma
1 Solution

isoutamo
SplunkTrust
SplunkTrust

1st you should add also maxVolumeDataSizeMB to _splunk_summaries volume.

volume:hot said that it's max size is 3.2TB per indexer. These values are always per individual indexer not for total cluster. Total size depends how many indexers you have in your cluster.

In indexer cluster the total size of used storage for index depends what are you SF + RF (search and replication factor) and have you an single or multisite cluster. But as I said all those configurations in CM's indexes.conf are valid for each individual host on cluster not total for cluster. So basically each node could have that 1.95TB on it's coldPath or one can have e.g. 1TB and second 1.5TB and another that 1.95TB. That depends how well your data have distributed over indexers in cluster and how those buckets are replicated etc.

I hope that this explains it.

r. Ismo

View solution in original post

andynewsoncap
Engager
  • This is a problem I have been struggling with for years. I don’t understand why the splint platform can’t do this itself. It’s even more complicated because the TSIX and the raw data both have compression ratios which are individual to each index so to do this properly not only do you need to know the number of days you wish to keep the size of that data but also the compressor ratio for each of these indexes 
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Well... apart from the obvious cheap shot at your "splint" (but I suppose it might have been auto-correct), there is an issue of "how would you do it better"? Remember that there are many factors at play here - amount of available space, retention time requirements, different types of storage.

The current bucket management machinery does allow for quite a bit of flexibility but you can't just produce storage out of thin air.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Hi

This is how I have done this.

You are already using volumes which is an excellent (IMHO mandatory) thing. You should define total size of volumes 78TB (maxVolumeDataSizeMB) - some overflow space as time by time there will be more data on indexes than you have defined those size before splunk start to migrate / frozen those buckets. Actual size for overflow depends have you cluster on single node and how many hot buckets you have configured etc. Also there are some bugs which cause need for more free space than earlier to avoid full disk space.

Then define index max size  (maxTotalDataSizeMB) which define total hot/warm + cold size (default is 500GB). After that fine tuning this with hot/warm + cold sizes.

And reality is that the size of your index depends which limits hits first (volume size, index size, hot/warm vs cold size or amount of buckets).

r. Ismo

0 Karma

andynewsoncap
Engager

Split yes sorry it was 3am (one of those arr moments).

So as part of this automation I need to build to grow / shrink the disk, part of this.  Is key.

In an ideal world Splunk would inform the automation I need to grow / shrink the volume on the cluster nodes.

the update the automation would update splunk .conf files to set maxTotalDataSizeMB < the total disk now available in each cluster. node.  And then adjust the .conf for each index.

Key to this is scan for all indexes.  Get the daily compression ration of the TXIDX.  The compression ration of the RAW data.  And the Daily data through put per index.

For me I need 90 days data.  So into this build in a safety factor.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Ok. So you have a very unusual border case. Typically systems need to handle constant amount of storage and that's what Splunk does.

0 Karma

andynewsoncap
Engager

I would respectfully disagree.  Ours is an observability platform.  And to turn on anomaly detection and predictive analyst.  We need a constant flow of 90 days metrics data.  This is not a static number as customers add and remove CI's from their platforms.   We are also an MSP with 100's of customers.  And for data solvency we create many indexes per customer.  Typically each of our platforms has between 200-300 customer indexes.

I am trying to automate out the TIOL of constantly having to review and update the config to ensure we keep the right data ingestion to feed and water the anomaly detection and predictive analyst.  

With out over provisioning our storage.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

And you really think this is a common use case? I use windows at home in a VM with a GPU passthrough but I don't say that it's something used at a typical desktop.

0 Karma

edwardrose
Contributor

I did forget to add the following:

 

[default]
frozenTimePeriodInSecs = 10368000
homePath.maxDataSizeMB = 3000000
coldPath.maxDataSizeMB = 10598400

[volume:_splunk_summaries]
path = /splunk/cold/splunk_summaries

[volume:hot]
path = /splunk/hot
maxVolumeDataSizeMB = 3400000

[volume:cold]
path = /splunk/cold
maxVolumeDataSizeMB = 10957620

 

This is where I get confused.  We have a total amount of storage of  68TB of hot storage divided among the 20 indexers.  So each indexer has a 3.4TB volume.  And we have 220TB of cold storage with each indexer having a 11TB.  I gave the default value for the homePath 3TB with 400GB of extra room and I gave coldPath 10.5TB with 500GB of extra room.  

But if from my example 90 days of hot/warm data for wineventlog is 1.95TB does Splunk divide that automatically between all 20 indexers and the applies homePath of 1.95TB to the total amount of data across all 20 indexers?

0 Karma

isoutamo
SplunkTrust
SplunkTrust

1st you should add also maxVolumeDataSizeMB to _splunk_summaries volume.

volume:hot said that it's max size is 3.2TB per indexer. These values are always per individual indexer not for total cluster. Total size depends how many indexers you have in your cluster.

In indexer cluster the total size of used storage for index depends what are you SF + RF (search and replication factor) and have you an single or multisite cluster. But as I said all those configurations in CM's indexes.conf are valid for each individual host on cluster not total for cluster. So basically each node could have that 1.95TB on it's coldPath or one can have e.g. 1TB and second 1.5TB and another that 1.95TB. That depends how well your data have distributed over indexers in cluster and how those buckets are replicated etc.

I hope that this explains it.

r. Ismo

Get Updates on the Splunk Community!

Splunk Enterprise Security 8.0.2 Availability: On cloud and On-premise!

A few months ago, we released Splunk Enterprise Security 8.0 for our cloud customers. Today, we are excited to ...

Logs to Metrics

Logs and Metrics Logs are generally unstructured text or structured events emitted by applications and written ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...