I've read several cluster deployment references but still have no clearly answer for one question.
I need to store 50 TB of data in a cluster with 30-50 typical peers which have 1-2TB RAID1,10 storage on each. I need to expand this storage by simply adding peers. I do not want to use Storage Systems with 25-50TB pool (it's so expensive). Can Splunk spread buckets to other peer and holds on first peer only part of entire buckets?
Аccording "Buckets and clusters" first peer holds all buckets, but when it's dead all buckets spreads across all cluster.
Is this any performance impact on searching? Or we must use storage systems or third party file system virtualization tools?
Hm, no. Well.
With a clustered setup, all peers will hold buckets. There is no layered/tiered indexer structure, where there are primary and secondary indexers.
With a normal setup, forwarders will send data to all indexers (loadbalancing between them). Then as part of the index replication functionality, indexers will send data between themselves, in order to have redundant copies of the indexed data. Thus, each indexer will have both primary buckets (containing data that came straight from a forwarder) and replicated buckets (which were copied from another indexer).
Assuming that you have 3 indexers, and a replication factor of 2, and a search factor of 2, the bucket distribution could look like this.
UPPERCASE = primary buckets
lowercase = replicated buckets
host indexer1 indexer2 indexer3 Primary A, D B, E C, F Replicated e, c a, f b, d
With both RF=2 and SF=2, data will take up twice the space. So if your original logs are 50 TB, you can count on an average compression rate of 50% (compressed raw data + indexes for making it searchable), netting 25TB. But since you have index replication your storage needs are doubled (for this scenario), so you're back at needing 50TB of hard drive space.
Hope this helps,
Hi Kristian! Thanks for the answer and late accept.
Now I clearly figured this.
So, with the explanation that Kristian gave, and your data you will have:
50 peers with 2TB = 100TB so you can storage up to 50TB (x2 due to replication)
RF= 2 and SF = 2
Well, as always, the answer is "It depends". There may be reasons against this for geographical or topological reasons. But in theory, yes, spreading the data over more indexers allows for faster search results.
Splunk compresses the data before storing it on disk. It also need to build search files (TSIDX) on top of the raw data to speed up searching.
The following blog post talks about storage requirements in clustering.