i have to set up a Archiving policy and storage requirements in SPlunk. Estimated logs per day would be 100 GB. So if i go by documentation SPlunk will index 50 GB(with a compression rate of 50%). Then As the data will get old it same move 50 Gb of data from Hot->Warm->Cold. At this point i will setup a archival policy to S3(AWS). I wanted to know whether splunk will archive whole 50GB or 100 Gb data in S3 and What amount of data will be indexed back. Is it going to be 50Gb>
Normally, on average, Splunk will compress raw data to about half the size, or thereabouts. So your original 100GB will now be 35GB of index-files and 15GB of compressed data, according to a rough estimate.
When data is
frozen - which is what I assume you mean by "archival policy", only the compressed data is saved, and the index-files are deleted. So only about 15% of the original size of the raw data is archived. 15GB
When/if you need to restore archived (frozen) data, you will need to rebuild the index-files before you can search it again. Back to 15+35 GB.
So the "50%" would be the size of the bucket as a whole, compared to the uncompressed .gz found in its rawdata directory.
This can vary from bucket to bucket, and will depend on the compressability of the log data coming in. Over a diverse set of log sources, the figure "50%" is commonly mentioned as an average compression rate.
That's where the 'main' index (defaultdb) is stored. In this folder you will find the hot and warm buckets as subdirs, e.g.
Inside a bucket there will be some metadata files and .tsidx-files (indexes for searching the raw data). Finally there will be a directory called 'rawdata' that contains the zipped raw data.