A few months back I was doing a dashboard and looking at various disk usage charts, one being Overall Disk Usage
As I was doing research, I came across several posts that mentioned a rule of thumb of divide by two. Utilizing this rule, we were able to successfully pull the proper numbers.
We are just now having a discussion on a conference call, and the divide by two rule came up. I cannot for the life of me google the right phrase to find out where this came from, and why.
Does anyone have any insight?
When your data is indexed by Splunk, it will compress the files depending on how many unique key-value pairs you have. The more unique key-value pairs you have means a larger tsidx file. A general rule is the tsidx file will compress to around 35% of the original raw data size while the journal.gz (your raw data) will take roughly 15%. So if you add these up, you have your 50%
Reducing size by 50% is a good first approximation, but the actual performance is going to be completely data dependent. Here are some useful references:
1) This answer says "I typically see about 40% to 50% compression"... https://answers.splunk.com/answers/52075/compression-rate-for-indexes-hot-warm-cold-frozen.html
2) That's probably the underlying assumption behind the 1/2 in this answer...
3) One of the answers in this one has some useful breakdown information. "It´s usually about half of the original size, so for your question 100GB would need about 50gb, from those around 10gb would be the original logs zipped, and 40gb the indexes."
4) However, there are other answers that, in specific situations, indicate that more disk space is needed to store index+data than was in the original data. (fully indexing csv input for example).