I have a theoretical question about multisite indexer clustering.
site_replication_factor is how many copies of the raw data (unsearchable) are replicated within the cluster, and
site_search_factor is how many copies of searchable data (which also contains the raw data). Then could I set up an environment with a configuration such as:
[clustering] mode = master multisite=true available_sites=site1,site2,site3,site4 site_replication_factor = origin:1,site4:1,total:2 site_search_factor = origin:1,site4:0,total:3
[clustering] mode = master multisite=true available_sites=site1,site2,site3,site4 site_replication_factor = origin:1,site4:1,total:2 site_search_factor = origin:1,site1:1,site2:1,site3:1,total:3
The objective would be to have a designated site which would only be a store for the raw (unsearchable) data, therefore wouldn't be searched or used for anything else. While having the three other sites set up in a more standard configuration, where each has a copy of its own raw data, and a distributed copy of the searchable data.
I can't find anywhere in the documentation which says if you can specify
site4:0 to restrict searchable data being replicated to a specific site.
If the above works, this would minimize the copies of raw data (unsearchable) within the cluster (saving space), but ensure the is always a site with a full backup of ALL raw data from around the cluster which could be used to rebuild ALL indexed data in the event of extensive disaster.
First, you are a bit mixed up in your definition of the search factor. The search factor can never be larger than the replication factor. The search factor defines how many of the replicated buckets will be searchable. The search factor is not "added to" the replication factor. I think of it this way
Now, I think you can do what you want, but the syntax needs to be something like this:
available_sites=site1,site2,site3,site4 site_replication_factor = origin:1,site4:1,total:3 site_search_factor = origin:1,site4:0,total:3
This would force site4 to have only non-searchable buckets - but it would have a copy of all the rawdata in case of a disaster.
BTW, I assume that you have no forwarders sending data to site4. If you do, then site4 will be the origin site for some data, and therefore there will be searchable buckets at site4.
I do have to ask though: why have so many sites if you aren't going to have at least one copy of the buckets (searchable or not) at each site? Since sites are purely defined by you, and not actually tied to geography, I would only have 2 sites in your example.
available_sites=site1,site2 site_replication_factor = origin:1,site2:1,total:3 site_search_factor = origin:1,site2:0,total:3
Finally, like you, I did not find anything in the documentation that said whether the search factor could be explicitly set to zero for a site.
Thanks for the clarification. I did understand that, just didn't do a good job of explaining it!
I'll try and give this a try at some point.
With regard to why so many sites, this was just an example.
Did you get anywhere with this?
I'm also interested in what you are suggesting as we might have legal issues if some data leaves certain countries.
The way we initially plan to approach this was to designate a pair of Heavy Forwarders per country that will perform forwarding, local indexing and filtering, and if we ever needed to search for country-sensitive data we could always go to the local HF and use the GUI there as it won't be searchable from anywhere else.