Deployment Architecture

Is there any advantage to setting RF (Replication Factor) to a different number than SF (Search Factor) ?

Motivator

For example, what benefit would I get from using a Replication Factor of 1 and a Search Factor of 2, vs a Replication Factor of 2 and a Search Factor of 2?

Splunk Employee
Splunk Employee

Rob

You are mostly right. The decision why to increase replication-factor and search-factor are resource usage versus disaster resilience versus high availability.

  • Making a Replicated copy costs : network traffic, disk i/o, disk space
  • Making a Replicated Copy Searchable cost : cpu, memory, disk i/o, extra disk-space

Rule of Thumb and example with : 4 indexers in a cluster

  • You always want Replication Factor >= Search Factor (example 2/2 or 3/2 etc...)

  • Replication Factor : If you want to not lose data when N of your servers are definitively lost you need Replication Factor > N+1 (*example : I have 4 servers in the cluster and want to be able to not lose data when 2 of them are dead -> I need Replication Factor =3 *)

  • Search Factor : Immediately after you lost N servers all you data is still present.
    Do you want your remaining copies to be immediately searchable ?

    • If yes then you need Replication Factor=Search Factor (example 3/3, all copies are maintained searchable)
    • If not, then you can have Replication Factor > Search Factor and wait until the copies are made searchable automatically (example 3/1 only the primary was searchable, then a new one will be made searchable when needed)''

Motivator

I've seen the formulas for calculating the base amount of disk consumed for RF and SF and those make sense to me. Although, I haven't seen any formulas that also calculate the the disk space consumed during an outage. The fear would be that disk space runs out during a failure of one or more indexers when the SF is lower than the RF.

0 Karma

Splunk Employee
Splunk Employee

Yes, consider that a searchable copy will take as much disk space than the primary bucket.
While a non searchable copy may be up to 30% smaller. (no tsidx files, no metadata files)

(compare your bucket to see your actual ratio, it may vary depending the events and the optimization of the tsidx)

Example If you have Search Factor =3, then you can count that if your index was 500GB, you now need 3x500GB. another way to see it is that if your keep the same disk space and increase the Replication/Search Factor from 1 to 3 , then 2/3 of your data will be deleted to store the copied.

0 Karma

Motivator

Thanks, Yann. How do I factor in disk space requirements for when the SF is below the RF. i.e. how much disk space do I need to keep around for the failures? Is there a formula? Should I allocate the same disk space for RF2/SF1 and RF2/SF2 knowing that I will likely use more when there is failure?

0 Karma

Legend

The replication factor (RF) sets the number of copies of the raw data. Setting the RF to a higher number increases your protection against data loss if search peers (indexers) go down.

The search factor (SF) controls "how many copies of the data are searchable." In fact, this is the number of copies of the index files. Setting the SF to a higher number increases the likelihood that users will be able to search without interruption even if search peers go down.

Your search factor can never be higher than your replication factor. [Think about it - if I have only a single copy of my data (RF=1) and then I lost a search peer - then it wouldn't matter if I had many copies of the index files (SF>=2) - the data would not be searchable because it would be lost!]

How many copies of the data (RF) and how many copies of the index files (SF) are decisions that you have to make based on how much resource you want to spend (indexers and disks) vs. uptime and data recovery requirements.

I hope this makes things a little more clear.

Motivator

Thanks, Josh. I'm going to stick with keeping search and replication factors at the same value as a best practice until someone tells me otherwise. While I may be able to save some disk space, it seems I may be introducing some recovery headaches during outages by setting the Search Factor to a lower number than the Replication Factor.

0 Karma

Splunk Employee
Splunk Employee

Clustering seeks to regain its configured level of redundancy once outages occur. Therefore avoiding additional resource usage during failures is not realistic.
Instead, plan to keep your capacity to a level where you can handle your indexing AND search workload during outage with reduced capacity with the additional load of reduplicating data. Perhaps ballpark 20% contention from reduplicating. At very small cluster sizes where the redundancy factors cannot be satisfied during outages, this specific aspect of concern is not relevant.

Motivator

Hi, Lisa, thank you for the quick response. Let me clarify my question a little better.

RF = Replication Factor (replicated copies) SF = Search Factor (online copies)
These are the possible combinations for the discussion.

•         RF2,SF1

•         RF2,SF2

•         RF3,SF1

•         RF3,SF2

•         RF3,SF3

I'm going to make a few assumptions.

   •         RF2/SF2 is the most commonly used choice?

    •         Setting higher RF to RF2 or RF3 will give more protection at the cost of more disk space and possibly a minimal amount CPU and network bandwidth

    •         Most enterprises are not using anything above RF3/SF3.

    •         Many of the examples talk about the disk space required which implies saving disk space by using a SF that is lower than the RF.

Advantages of having SF lower than RF

•         There is a potential to save on disk space (It appears I can get an initial disk savings of approximately 35% if I use RF2,SF1 vs RF2,SF2)?

•         There is a potential to save on disk space for index backups

Potential Issues/Concerns

•         I assume if the SF has to be rebuilt during a failure (in the scenario where RF1 is lower than SF), that may entail a peak in CPU usage on one or more indexers where the replica resides when a failure occurs to re-build the indexes from the event data.

•         Although, I may have saved disk space initially by setting the SF lower than the RF, i.e. RF2,SF1, would not all or some of that disk spaced be consumed during a failure scenario?

My inclination is that it would be a best practice to keep the RF and SF always at the same value as the risk of spiking CPU and/or consuming additional disk during failures or maintenance might be an issue.

Thanks,

Rob

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!