size of both internal and other indexes per day

jarush · ‎03-29-2020

I'm trying to find a way to programmatically get the average size of data flowing into each index on a daily basis so we can set indexes.conf max data retention based on how many days we want to hold for each index. There seem to be a variety of ways to get the size of indexes per day, none of which seem to match. There are a lot of posts on this over the last decade, but reading all of them has just left me more confused.

Ideally I can figure out the average amount of MB required on disk (across all indexers in a site) per index. Then I can define how many days I want to retain on a per index basis and set homePath.maxDataSizeMB to avg_per_day_size_mb * number_of_days_to_retain. Ignore cold/frozen for the moment and assume that all hot/warm is going to the same path (e.g. no smartstore). How can I do this?

The basic ways I've seen:
1. Use dbinspect. I can filter by state = "hot" or "warm" as well as filter by bucketId to throw away replicated copies of the bucket (either due to replication factor or site replication factor). Given that dbinspect shows you a snapshot in time of buckets, I don't think it accurately shows me the amount of new data coming into the index per day.
2. Use license usage log. When I average this metric out to a per day per index value, it feels like it is fairly close to the amount of data I'm storing across splunk (looking at du values on linux hosts). However, it doesn't count all indexes, notable events, internal logs, etc, which I need to account for.
3. Estimate raw data size for messages (e.g. len(_raw)) over a short period of time and then use those values with tstats over a longer period of time to estimate size. This feels like it will give the closest value, however it feels like a really dirty way to get the data.
4. Use metrics log to look at throughput values. Reading documentation, this just seems to be plain wrong as it stores only samples and doesn't show size of the message after it comes out of the various parsing queues.

What are folks doing to reliably size their indexes.conf based on # of days they want to retain per index?

jarush · ‎03-30-2020

I ended up taking this approach and the numbers are validating pretty well. I'm sure they can be joined together in a single Splunk query, but I kept getting wonky results when using join, so I just join the numbers together in the code that generates our indexes.conf.

Get the avg count per day for the last business week by index:

| tstats count where (index=* OR index=_*) AND earliest=-w@w+1d latest=-w@w+6d by index
| eval eventCountPerDay=count/5
| table index, eventCountPerDay

Then get the size events per index:

| dbinspect index="*" earliest=-w@w+1d latest=-w@w+5d
| where sizeOnDiskMB>1
| search state="hot" OR state="warm"
| eval mbPerEvent=sizeOnDiskMB/eventCount
| stats median(mbPerEvent) as avgMBPerEvent by index

Since we are constantly growing the environment we are monitoring we can run this on a regular basis (monthly) and true-up index sizes and add storage as needed to meet the warm retention requirements.

woodcock · ‎03-29-2020

Use the sizing tool when planning new installations:
https://www.splunk.com/en_us/blog/tips-and-tricks/splunk-sizing-made-easy.html

jarush · ‎03-29-2020

The input to this tool it the thing I'm trying to figure out...

woodcock · ‎03-29-2020

You have to decide on an architecture first: Google splunk validated architecures. That is 1/3 of your equation. Then decide on desired retention. That is another 1/3. Then measure data velocity. That is the last 1/3.

woodcock · ‎03-29-2020

This is backwards thinking. What you need to do is use volume-based settings to allow the maximum use of your available disk; in other words, keep as much data as will fit, even though you don't think that you need it, and it is much more than your required retention goal. Then monitor bucketmover events in the _internal index when splunk freezes buckets. Each bucket name has 2 timestamps in it which specifies the time range for the data it contains. So just track these events by subtracting the larger number from now() for each index and you can continuously calculate your effective retention and plan accordingly as it gradually erodes.

jarush · ‎03-29-2020

That feels like an unnatural way to think about it. I've got business requirements for keeping x days of each index and I need to acquire as much storage as needed to meet those requirements. How do I know how much storage to present to my indexers if I can't predict the amount of storage needed?

woodcock · ‎03-29-2020

Start out with may more than you need so that you won't have to add any disk for 2 years. If you end up using up disk more than you expect, you will know immediately.

jarush · ‎03-29-2020

I prefer a more scientific approach to system sizing and maintenance than "prayer".

size of both internal and other indexes per day

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life