Deployment Architecture

How to improve index replication speed?

murikadan
Path Finder

Dear Splunkers,

I am performing migration of a multi site indexer cluster with 2 sites. RF=2, SF=2 with 1 copy of raw data and tsidx data in each site. Total 40 indexers with 20 indexers each per site.

Approach is as follows:

  1. Bring up 40 new indexers, 20 each in new site
  2. Put each of the 40 old indexers in detention
  3. Configure forwarders to forward data only to new indexers
  4. Do an indexer data rebalance
  5. Offline indexers one by one in each site alternatively, with enforce counts enabled (indexers do need to support search heads as usual)

I am currently at step 5, problem is that offlining each indexer takes couple of hours. I am aware that lot of factors including not the least of which are hardware bound and the amount of data (~900T in total) plays a significant role here. Nevertheless I would like to know if there are still improvements that can be made here through Splunk configuration changes.

Appreciate your thoughts,
Thanks,

0 Karma

beatus
Communicator

Murikadan,
You can adjust the number of buckets worked on by a peer on the Mast Node. In your configurations you can add the following:
server.conf

[clustering]
max_peer_build_load = <integer>
* This is the maximum number of concurrent tasks to make buckets
  searchable that can be assigned to a peer.
* Defaults to 2.

max_peer_rep_load = <integer>
* This is the maximum number of concurrent non-streaming
  replications that a peer can take part in as a target.
* Defaults to 5.

max_peer_sum_rep_load = <integer>
* This is the maximum number of concurrent summary replications
  that a peer can take part in as either a target or source.
* Defaults to 5.

Provided you have the hardware to handle the additional CPU, memory and disk load these values can be safely increased. Not knowing what your environment is, I'd recommend some caution and increase the settings in small increments while monitoring load on your Indexers.

Additionally, these settings can be modified in memory only (As in, run time & not saved to config) with the following commands (No restart required!). Perform these on the Master Node:

splunk edit cluster-config -max_peer_build_load 4
splunk edit cluster-config -max_peer_rep_load 10

edoardo_vicendo
Contributor

That's a good suggestion, just wanted to add what we discovered.

During the data migration we have found out that even increasing the following setting in the Master Node it was not speeding up the replication process.

max_peer_build_load

As written here:

https://docs.splunk.com/Documentation/Splunk/latest/Indexer/Takeapeeroffline#:~:text=If%20the%20sear....

This is due for the following reason:

The search factor. This determines how quickly the cluster can convert non-searchable copies to searchable. If the search factor is at least 2, the cluster can convert non-searchable copies to searchable by copying index files to the non-searchable copies from the remaining set of searchable copies. If the search factor is 1, however, the cluster must convert non-searchable copies by rebuilding the index files, a much slower process. (For information on the types of files in a bucket, see Data files.)

The time required to rebuild the index files on a non-searchable bucket copy containing 4GB of rawdata depends on a number of factors such as the size of the resulting index files, but 30 minutes is a reasonable approximation to start with. Rebuilding index files is necessary if the search factor is 1, meaning that there are no copies of the index files available to stream. A non-searchable bucket copy consisting of 4GB rawdata can grow to a size approximating 10GB once the index files have been added. As described earlier, the actual size depends on numerous factors.

Therefore copying tsidx files via network is much more faster than rebuilding them at the target peer.

Saying that, increasing the max_peer_build_load could be bounded by your network bandwidth and so if you are already using all the available bandwidth (or you have intentionally limited it to avoid issues within your network infrastructure) you will not have any benefit.

Instead, in a scenario in which with the default value (max_peer_build_load = 2) you are not sending data at the maximum available network speed, increasing that value can significantly improve the replication process.

0 Karma

gjanders
SplunkTrust
SplunkTrust

Splunk 7.0 has some new features that might help in this area however beyond having faster I/O and/or faster servers I am unsure if there are any tweaks you can do to improve this...

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...