Deployment Architecture

Planning Cluster Total Storage Capacity (when no one peer holds entire buckets set)

ejpulsar
Path Finder

Hi,

I've read several cluster deployment references but still have no clearly answer for one question.

I need to store 50 TB of data in a cluster with 30-50 typical peers which have 1-2TB RAID1,10 storage on each. I need to expand this storage by simply adding peers. I do not want to use Storage Systems with 25-50TB pool (it's so expensive). Can Splunk spread buckets to other peer and holds on first peer only part of entire buckets?

Аccording "Buckets and clusters" first peer holds all buckets, but when it's dead all buckets spreads across all cluster.

Is this any performance impact on searching? Or we must use storage systems or third party file system virtualization tools?

Tags (3)
0 Karma
1 Solution

kristian_kolb
Ultra Champion

Hm, no. Well.

With a clustered setup, all peers will hold buckets. There is no layered/tiered indexer structure, where there are primary and secondary indexers.

With a normal setup, forwarders will send data to all indexers (loadbalancing between them). Then as part of the index replication functionality, indexers will send data between themselves, in order to have redundant copies of the indexed data. Thus, each indexer will have both primary buckets (containing data that came straight from a forwarder) and replicated buckets (which were copied from another indexer).

Assuming that you have 3 indexers, and a replication factor of 2, and a search factor of 2, the bucket distribution could look like this.

UPPERCASE = primary buckets
lowercase = replicated buckets

host         indexer1    indexer2    indexer3
Primary      A, D        B, E        C, F
Replicated   e, c        a, f        b, d

With both RF=2 and SF=2, data will take up twice the space. So if your original logs are 50 TB, you can count on an average compression rate of 50% (compressed raw data + indexes for making it searchable), netting 25TB. But since you have index replication your storage needs are doubled (for this scenario), so you're back at needing 50TB of hard drive space.

Hope this helps,

K

View solution in original post

mahamed_splunk
Splunk Employee
Splunk Employee

Splunk compresses the data before storing it on disk. It also need to build search files (TSIDX) on top of the raw data to speed up searching.

The following blog post talks about storage requirements in clustering.

http://blogs.splunk.com/2013/01/31/disk-space-estimator-for-index-replication/

0 Karma

kristian_kolb
Ultra Champion

Well, as always, the answer is "It depends". There may be reasons against this for geographical or topological reasons. But in theory, yes, spreading the data over more indexers allows for faster search results.

/K

ejpulsar
Path Finder

Hello

Should we point forwarders on all 50 peers?

0 Karma

gfuente
Motivator

Hello,

So, with the explanation that Kristian gave, and your data you will have:

50 peers with 2TB = 100TB so you can storage up to 50TB (x2 due to replication)
RF= 2 and SF = 2

Regards

0 Karma

kristian_kolb
Ultra Champion

Hm, no. Well.

With a clustered setup, all peers will hold buckets. There is no layered/tiered indexer structure, where there are primary and secondary indexers.

With a normal setup, forwarders will send data to all indexers (loadbalancing between them). Then as part of the index replication functionality, indexers will send data between themselves, in order to have redundant copies of the indexed data. Thus, each indexer will have both primary buckets (containing data that came straight from a forwarder) and replicated buckets (which were copied from another indexer).

Assuming that you have 3 indexers, and a replication factor of 2, and a search factor of 2, the bucket distribution could look like this.

UPPERCASE = primary buckets
lowercase = replicated buckets

host         indexer1    indexer2    indexer3
Primary      A, D        B, E        C, F
Replicated   e, c        a, f        b, d

With both RF=2 and SF=2, data will take up twice the space. So if your original logs are 50 TB, you can count on an average compression rate of 50% (compressed raw data + indexes for making it searchable), netting 25TB. But since you have index replication your storage needs are doubled (for this scenario), so you're back at needing 50TB of hard drive space.

Hope this helps,

K

ejpulsar
Path Finder

Hi Kristian! Thanks for the answer and late accept.
Now I clearly figured this.

0 Karma
Get Updates on the Splunk Community!

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

SignalFlow: What? Why? How?

What is SignalFlow? Splunk Observability Cloud’s analytics engine, SignalFlow, opens up a world of in-depth ...

Federated Search for Amazon S3 | Key Use Cases to Streamline Compliance Workflows

Modern business operations are supported by data compliance. As regulations evolve, organizations must ...