Solved: How to estimate indexer data replication time from...

DEAD_BEEF · ‎10-10-2019

I was asked to come up with some rough numbers on how long it would take to rebuild an indexer if one completely died. So, if I were to remove an existing indexer from my multi-site cluster (2 sites) and replace it with a new one in it's place. I know there are a lot of variables but I am asking for help on how to get some rough numbers. The last time this happened, it took about 12 hours for the cluster to meet RF/SF after replacing a single indexer.

How can I calculate an estimate for this?

To simplify the question, assume the following:

20 indexers (10 at each site)
10TB of data (hot+cold) on each indexer
RF=3, SF=2
Splunk recommended hardware (800 IOPs)
Minimal WAN latency between the two sites (100-150ms)
Default 5 fix-up tasks per indexer
50,000 buckets per indexer
10 Gigabit circuit

In essence, the cluster would need to reproduce 10TB of data, or 5TB would be done by indexers in 1 data center and 5TB by the other (assuming 50% split in work load).

Would this just be 10TB = 80,000 Gb / 5 Gbps = 16,000 seconds (4.5 hours)? That's very conservative compared to my real life experience where it took 12 hours. What am I missing in my calculation?

ololdach · ‎10-10-2019

Hi DEAD_BEEF,
just a couple of clues, on where you might be missing some additional time:
- rebuilding the cluster is not high priority. Depending on the indexing and search load, the indexers will nice the bucket replications
- The data is not being transferred in bulk, but bucket per bucket. Every bucket received at the rebuilt target will be reprocessed locally before the next bucket is being fetched. Expect some processing overhead on both sides
- Depending if the rebuilt indexes will be searchable, the target has to rebuild the tsidx files that easily add up to 33% of the original bucket data
- The most current buckets need to roll in order to be replicated
- Depending on the speed of the Warm/Cold drives (where most of the data goes) one of your bottlenecks on both the sending and the receiving indexers will be the disks. Check the IO throughput on the warm/cold storage of your production indexers. I would bet that they can not saturate a 10Gbps link

In summary, given the question at hand, I suggest that the sustained I/O bandwidth of the warm/cold storage will determine the overall time to restore the indexer and should give you a rough estimate. If you have done a restore before, divide the data amount by the bandwidth of the drives. The difference to the observed time will be the processing overhead due to the factors above. With the overhead in % of the total time estimated for the data/bandwidth and the current data volume, you should be good to go.
Oliver

View solution in original post

ololdach · ‎10-10-2019

Hi DEAD_BEEF,
just a couple of clues, on where you might be missing some additional time:
- rebuilding the cluster is not high priority. Depending on the indexing and search load, the indexers will nice the bucket replications
- The data is not being transferred in bulk, but bucket per bucket. Every bucket received at the rebuilt target will be reprocessed locally before the next bucket is being fetched. Expect some processing overhead on both sides
- Depending if the rebuilt indexes will be searchable, the target has to rebuild the tsidx files that easily add up to 33% of the original bucket data
- The most current buckets need to roll in order to be replicated
- Depending on the speed of the Warm/Cold drives (where most of the data goes) one of your bottlenecks on both the sending and the receiving indexers will be the disks. Check the IO throughput on the warm/cold storage of your production indexers. I would bet that they can not saturate a 10Gbps link

In summary, given the question at hand, I suggest that the sustained I/O bandwidth of the warm/cold storage will determine the overall time to restore the indexer and should give you a rough estimate. If you have done a restore before, divide the data amount by the bandwidth of the drives. The difference to the observed time will be the processing overhead due to the factors above. With the overhead in % of the total time estimated for the data/bandwidth and the current data volume, you should be good to go.
Oliver

How to estimate indexer data replication time from complete failure?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

What’s New in Splunk AI: Volume 02

Splunk App Dev Quarterly Roundup: AI, Agents, and Innovation!

Value Insights: Now Generally Available in the CMC

Join the Conversation