Hi all,
We are currently estimating our network bandwidth needs and one of the questions we are trying to answer is about compression ratios for index replication.
So let's assume all our data comes from one site. This is, let's say, 100GB per day. That goes through the following chain:
Collection (Universal Fws) ==(X GB)==> Filtering (Heavy FW) ==(Y GB)==> Storage (Index Site 1) ==(Z GB)==> Replication (Index Site 2)
I am trying to answer what the values of X, Y and Z (network throughput) would be on average.
Any help would be much appreciated.
Thanks,
J
SSL compression only applies during transit.
When the data arrives at the indexers the following will be true:
Data Size Against License - 100GB
Data Size Stored in Index - ~50GB
Data Size Replicated Across Network from Indexer 1 to Indexer 2 - ~50GB
The final compression ratio has a lot to do with the type of data. Example, binary data cant be compressed at all, text files compress up to 99%, etc. So we generally go with 50% final compression for capacity planning.
Honestly, it will vary based on how much indexing overhead you have more so than the type of data, as most data compresses to about 15% and then the key value pairs and other indexing overhead accounts for about 35% of the final storage requirement. Also your replication factor matters. The link above is your best bet.
Same answers are given here: https://answers.splunk.com/answers/147951/what-is-the-compression-ratio-of-raw-data-in-splunk.html
Hi,
just to add one important thing (if someone finds this article using a search engine).
In a normal clustered environment, a primary peer (indexer) is only streaming the compressed raw data (aka journal.gz) and some metadata, we estimate this volume with 15% of RAW.
http://docs.splunk.com/Documentation/Splunk/latest/Indexer/Bucketsandclusters
The secondary peer is then indexing the data again, depending on the search factor configured.
Just to make sure we are talking about the right thing....
If an indexer is streaming 100GB/day of raw data (which is approx 15GB/day in journal.gz) this results in about 1/6th megabyte/sec on the wire... which is approx 1.5 MEGABIT/sec...
I don't want to be rude... but
Guys, really? Bandwidth control in a datacenter? For 1/600th or 1/6000th of your available bandwidth?
HTH,
Holger
It’s not a matter of IF but WHEN a client will ask you to calculate how much bandwidth splunk will consume. The question is valid. So what are you adding to the answer here?
SSL compression only applies during transit.
When the data arrives at the indexers the following will be true:
Data Size Against License - 100GB
Data Size Stored in Index - ~50GB
Data Size Replicated Across Network from Indexer 1 to Indexer 2 - ~50GB
The final compression ratio has a lot to do with the type of data. Example, binary data cant be compressed at all, text files compress up to 99%, etc. So we generally go with 50% final compression for capacity planning.
Honestly, it will vary based on how much indexing overhead you have more so than the type of data, as most data compresses to about 15% and then the key value pairs and other indexing overhead accounts for about 35% of the final storage requirement. Also your replication factor matters. The link above is your best bet.
Same answers are given here: https://answers.splunk.com/answers/147951/what-is-the-compression-ratio-of-raw-data-in-splunk.html