Deployment Architecture

Compression ratio for index replication

javiergn
SplunkTrust
SplunkTrust

Hi all,

We are currently estimating our network bandwidth needs and one of the questions we are trying to answer is about compression ratios for index replication.

So let's assume all our data comes from one site. This is, let's say, 100GB per day. That goes through the following chain:

Collection (Universal Fws) ==(X GB)==> Filtering (Heavy FW) ==(Y GB)==> Storage (Index Site 1) ==(Z GB)==> Replication (Index Site 2)

I am trying to answer what the values of X, Y and Z (network throughput) would be on average.

  • X: Uncooked data. If we use SSL, compression ratio would be 1:14 on average according to this = 7.14GB
  • Y: Cooked data. SSL. Compression ratio 1:14 but no change with the previous one = 7.14GB
  • Z: Indexed data. Bucket is replicated and includes both RAW and Indexes = ?

Any help would be much appreciated.

Thanks,
J

0 Karma
1 Solution

jkat54
SplunkTrust
SplunkTrust

SSL compression only applies during transit.

When the data arrives at the indexers the following will be true:

Data Size Against License - 100GB
Data Size Stored in Index - ~50GB
Data Size Replicated Across Network from Indexer 1 to Indexer 2 - ~50GB

The final compression ratio has a lot to do with the type of data. Example, binary data cant be compressed at all, text files compress up to 99%, etc. So we generally go with 50% final compression for capacity planning.

http://docs.splunk.com/Documentation/Splunk/6.1.2/Indexer/Systemrequirements#Storage_requirement_exa...

Honestly, it will vary based on how much indexing overhead you have more so than the type of data, as most data compresses to about 15% and then the key value pairs and other indexing overhead accounts for about 35% of the final storage requirement. Also your replication factor matters. The link above is your best bet.

Same answers are given here: https://answers.splunk.com/answers/147951/what-is-the-compression-ratio-of-raw-data-in-splunk.html

View solution in original post

hsesterhenn_spl
Splunk Employee
Splunk Employee

Hi,
just to add one important thing (if someone finds this article using a search engine).
In a normal clustered environment, a primary peer (indexer) is only streaming the compressed raw data (aka journal.gz) and some metadata, we estimate this volume with 15% of RAW.
http://docs.splunk.com/Documentation/Splunk/latest/Indexer/Bucketsandclusters
The secondary peer is then indexing the data again, depending on the search factor configured.
Just to make sure we are talking about the right thing....
If an indexer is streaming 100GB/day of raw data (which is approx 15GB/day in journal.gz) this results in about 1/6th megabyte/sec on the wire... which is approx 1.5 MEGABIT/sec...

I don't want to be rude... but

Guys, really? Bandwidth control in a datacenter? For 1/600th or 1/6000th of your available bandwidth?

HTH,
Holger

0 Karma

jkat54
SplunkTrust
SplunkTrust

It’s not a matter of IF but WHEN a client will ask you to calculate how much bandwidth splunk will consume. The question is valid. So what are you adding to the answer here?

0 Karma

jkat54
SplunkTrust
SplunkTrust

SSL compression only applies during transit.

When the data arrives at the indexers the following will be true:

Data Size Against License - 100GB
Data Size Stored in Index - ~50GB
Data Size Replicated Across Network from Indexer 1 to Indexer 2 - ~50GB

The final compression ratio has a lot to do with the type of data. Example, binary data cant be compressed at all, text files compress up to 99%, etc. So we generally go with 50% final compression for capacity planning.

http://docs.splunk.com/Documentation/Splunk/6.1.2/Indexer/Systemrequirements#Storage_requirement_exa...

Honestly, it will vary based on how much indexing overhead you have more so than the type of data, as most data compresses to about 15% and then the key value pairs and other indexing overhead accounts for about 35% of the final storage requirement. Also your replication factor matters. The link above is your best bet.

Same answers are given here: https://answers.splunk.com/answers/147951/what-is-the-compression-ratio-of-raw-data-in-splunk.html

View solution in original post

.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!