how to calculate approximate data that needs to be indexed in order to procure licensing as there would be multiple sources
Thank you Rich
So, the most accurate and probably "best" answer here is to call your Splunk rep. If you contact them, they have teams of people they can rely on for pre-sales work like identifying approximate data volumes and stuff. I say this is probably best because it's a "full featured" solution, in that they'll help you in so many other ways too.
But, for your specifics - there's a couple of ways.
First, count what you have. Like, literally. Let's say you have log files you are rolling once per day (so one file is one day) and each file averages about 100 MB, with a maximum file size in the past week of 150 MB. Your first guess might be 150MB * 1.25, because though Splunk compresses data it also has overhead. That's 187 MB/day. I'd always always have some room for growth and slack, so that's at least 200 MB.
Your next bet is to just stand up a single box Splunk install - a throwaway VM with reasonable specs (2+ cores, 4+ GB of RAM and say 30 or 40 GB fo disk space) can be used as a non-production box running Splunk Free to just ingest a small portion of those files. Then take a look at your index size. So, for instance, on that particular set of data, you may find compression is great so it's only 150 MB after all is said and done. Maybe it has lots of indexed fields and it's 300 MB/day. Either way, this way is fairly accurate.
Repeat on the different kinds of data you have. For instance, all IIS or Apache logs will be pretty similar to one another, so the same calculations work. Windows Event logs will be different though, indeed for those you will pretty much have to pull in a handful of servers to see how much data is really generated.
Either way, take those numbers and multiply it out by how many of those files or servers you have. Repeat until you've accounted for everything you want to pull in.
Or again, call your rep. 🙂
Happy Splunking!
-Rich