I'm trying to find a way to programmatically get the average size of data flowing into each index on a daily basis so we can set indexes.conf max data retention based on how many days we want to hold for each index. There seem to be a variety of ways to get the size of indexes per day, none of which seem to match. There are a lot of posts on this over the last decade, but reading all of them has just left me more confused.
Ideally I can figure out the average amount of MB required on disk (across all indexers in a site) per index. Then I can define how many days I want to retain on a per index basis and set homePath.maxDataSizeMB to avg_per_day_size_mb * number_of_days_to_retain. Ignore cold/frozen for the moment and assume that all hot/warm is going to the same path (e.g. no smartstore). How can I do this?
The basic ways I've seen:
1. Use dbinspect. I can filter by state = "hot" or "warm" as well as filter by bucketId to throw away replicated copies of the bucket (either due to replication factor or site replication factor). Given that dbinspect shows you a snapshot in time of buckets, I don't think it accurately shows me the amount of new data coming into the index per day.
2. Use license usage log. When I average this metric out to a per day per index value, it feels like it is fairly close to the amount of data I'm storing across splunk (looking at du values on linux hosts). However, it doesn't count all indexes, notable events, internal logs, etc, which I need to account for.
3. Estimate raw data size for messages (e.g. len(_raw)) over a short period of time and then use those values with tstats over a longer period of time to estimate size. This feels like it will give the closest value, however it feels like a really dirty way to get the data.
4. Use metrics log to look at throughput values. Reading documentation, this just seems to be plain wrong as it stores only samples and doesn't show size of the message after it comes out of the various parsing queues.
What are folks doing to reliably size their indexes.conf based on # of days they want to retain per index?
I ended up taking this approach and the numbers are validating pretty well. I'm sure they can be joined together in a single Splunk query, but I kept getting wonky results when using join, so I just join the numbers together in the code that generates our indexes.conf.
Get the avg count per day for the last business week by index:
| tstats count where (index=* OR index=_*) AND earliest=-w@w+1d latest=-w@w+6d by index
| eval eventCountPerDay=count/5
| table index, eventCountPerDay
Then get the size events per index:
| dbinspect index="*" earliest=-w@w+1d latest=-w@w+5d
| where sizeOnDiskMB>1
| search state="hot" OR state="warm"
| evaal mbPerEvent=sizeOnDiskMB/eventCount
| stats median(mbPerEvent) as avgMBPerEvent by index
Since we are constantly growing the environment we are monitoring we can run this on a regular basis (monthly) and true-up index sizes and add storage as needed to meet the warm retention requirements.
You have to decide on an architecture first: Google
splunk validated architecures. That is 1/3 of your equation. Then decide on desired retention. That is another 1/3. Then measure data velocity. That is the last 1/3.
This is backwards thinking. What you need to do is use
volume-based settings to allow the maximum use of your available disk; in other words, keep as much data as will fit, even though you don't think that you need it, and it is much more than your required retention goal. Then monitor
bucketmover events in the
_internal index when splunk
freezes buckets. Each bucket name has 2
timestamps in it which specifies the time range for the data it contains. So just track these events by subtracting the larger number from
now() for each
index and you can continuously calculate your
effective retention and plan accordingly as it gradually erodes.
That feels like an unnatural way to think about it. I've got business requirements for keeping x days of each index and I need to acquire as much storage as needed to meet those requirements. How do I know how much storage to present to my indexers if I can't predict the amount of storage needed?