Recently, I noticed that the disk on one of my Indexers was nearly full. Currently, all event data is going into the main index and we had all the defaults set for bucket rolling behaviors in the main index. The server has been indexing data for at least two years.
We want to retain searchable event data going back 1 year and are not concerned with archiving beyond that, so I changed the archive policy to be more restrictive (changed frozenTimePeriodInSecs to 31556952 in $SPLUNK_HOME/etc/system/local/indexes.conf). I was expecting that this would free up a lot of space by rolling data older than 1 year out and deleting it, but it didn't. I came back on Monday morning after making this change and barely a dent was made in the amount of free space. There are no cold buckets in my main index's coldpath right now, so my change must have had some effect.
I suspect that this Indexer was incorrectly sized when it was first set up and that has led to this disk space issue. We intake ~2.5 GB/day on this Indexer. The disk is 200 GB in total, with 140 GB set up for the main index (which includes hot/warm/cold buckets)
Do I need to add more drive space and increase the size of my main index in order to fix this problem?
Yes you probably do need to add more storage. If you are writing ~2.5 GB to the default index, and you plan to retain data for 365 days, your storage requirements would be:
2.5 GB * 365 * 50% (on-disk size) = 456 GB
You can use the Splunk sizing tool to help calculate your disk requirements:
If you are running Splunk v6.3 you can look at the DMC to get a better idea of how well data is compressing in your index, the time range represented by your index, and how often your buckets are freezing.
You can also pretty easily verify how data is expiring by searching for
index=_internal sourcetype=splunkd bucketmover will attempt to freeze
These messages indicate that Splunk is deleting data (moving from cold to frozen). The log message will also indicate the reason why the data is rolling off; either due to aging out (frozenTimePeriodInSecs) or because of the storage limit (maxTotalDataSizeMB).
Indeed. To clarify, you've restricted the index to 140GB - using the rule-of-thumb 2.5GB*50%, that'd be enough for 112 days. Restricting the retention to a year won't change anything because there is no data that old to delete.
Makes sense! I think it must have been even worse before, because we had the index restricted to 140 GB, but the retention period was the default, which I believe is ~6 years. So we had storage that was sized to retain data for 112 days, but wouldn't delete until the data was 6 years old. I'm surprised that we didn't have this space problem earlier.
The oldest bucket gets frozen as soon as you hit one of the two restrictions, size or age. In your case, Splunk deletes buckets as soon as your 140GB are full. Whether you theoretically allow one or six years is irrelevant to that, your primary constraint is space.
So if the storage needs for my main index are 456 GB, would I also set the maxTotalDataSizeMB to 456000? or so you typically set the maxTotalDataSizeMB to a value that is smaller that your total index space needs?
This is the slightly tricky part. You need to set both values. Lets stick with what you have in this example, which is a single indexer. You have a volume of data coming into your indexer. That volume, on-disk, takes up roughly 50% of your raw data size, but that value is just an estimate. So you calculate your disk requirements to be about 456 GB. You could set maxTotalDataSizeMB to 456 GB (466944 MB). But you need to consistently check to see if you're approaching that limit, otherwise you risk prematurely deleting data. You also need to make sure you have the storage available (if that means growing the volume, adding more physical disk, etc). You don't want to hit that limit and then realize you need more storage. So, you set the max age for your data to 1 year, that way you're not retaining data you don't need AND you set the max size of the db, so that you don't just fill your disk and stop indexing all together.
In our environment we are very paranoid about expiring data, so we have several monitors in place to let us know when we approach our maxTotalDataSizeMB. First, we use rest API queries to compare the current size of each index with its max size. Secondly, we watch our _internal index for bucket rolls due to any reason other than reaching the frozenTimePeriodInSecs value. Finally, all of our volumes have alerts configured when they start approaching capacity.
Sure! Here it is:
[default] minRawFileSyncSecs = disable throttleCheckPeriod = 15 rotatePeriodInSecs = 60 compressRawdata = true quarantinePastSecs = 77760000 quarantineFutureSecs = 2592000 maxTotalDataSizeMB = 140000 maxHotIdleSecs = 0 maxMetaEntries = 1000000 serviceMetaPeriod = 25 syncMeta = true assureUTF8 = false frozenTimePeriodInSecs = 31556952 blockSignatureDatabase = _blocksignature maxWarmDBCount = 300 maxConcurrentOptimizes = 3 coldToFrozenDir = blockSignSize = 0 maxHotBuckets = 3 enableRealtimeSearch = true maxHotSpanSecs = 7776000 coldToFrozenScript = memPoolMB = auto partialServiceMetaPeriod = 0 suppressBannerList = rawChunkSizeBytes = 131072 sync = 0 maxRunningProcessGroups = 20 defaultDatabase = main maxDataSize = auto
If you're saying that there is no bucket in the colddb folder (and there is no significant size decreased after you set the frozenTimePeriodSecs to 1 year), mean all your searchable data is stored in hot/warm buckets. Could you check the number of buckets in db folder ? (how many hot and how many warm)