Deployment Architecture

Splunk 6.1 multi-site cluster not replicating or working as intended?

1StopBloke
Explorer

Hi,

I moved to a multi-site cluster yesterday and I'm not entirely sure that replication is actually working within the cluster. It may not be, or it may be the splunk commands aren't playing nicely with the new multi-site cluster feature.

This is my clustering stanza in server.conf on the master

[clustering]
mode = master
multisite = true
available_sites = site1,site2
site_replication_factor = origin:2, site1:1, site2:1, total:3
site_search_factor = origin:1, site1:1, site2:1, total:2
pass4SymmKey = <REDACTED>
search_factor = 2
replication_factor = 3

I have 2 peers each in 2 sites, with 1 search head in each site. All of the Splunk servers in the cluster are assigned sites in servers.conf. I want to have a full searchable copy in each site for search affinity, thus the site_search_factor above.

I suppose the first thing I should say is that I'm getting my information from the splunk show cluster-status --verbose command or from the cluster settings page on the master.

When all 4 peers are up my search factor is met but all indices except 2 only get 2/3 for replication factor. All the others have between 1-8 buckets missing for the third copy and it never catches up. If I take down 1 peer in any site then my search factor goes to 1/2 for some portion of the indices and never recovers. The replication factor in this case will either stay at 2/3 or go to 1/3, it varies.
What makes me think this may be the tools working strangely is that it never recovers, despite no replication errors in splunkd.log (although I'm not sure if there are replication messages to fix up at all) and if I then bring up the node I brought down in a site then take down the other node in that site I get the same result.

Maintenance mode is off on the master.

If it's any help, when I upgraded to a multi-site cluster I made a mistake and didn't enable maintenance mode on the master before bringing up each peer (I ran the command but didn't notice it asked for a login). I'm not sure if that's broken something.

While I'm at it, does anyone know when the splunk remove excess-buckets command will be enabled for multi-site clusters? I think I've pretty much got a searchable copy on every peer by now.

1 Solution

dxu_splunk
Splunk Employee
Splunk Employee

This could be from your migrated non-multisite buckets. We try not to replicate the old buckets (from non-multisite) across sites, but instead leave them on the site we think they originated from. They follow the old "replication_factor" + "search_factor" instead of the new "site_replication_factor" + "site_search_factor".

If your replication_factor is 3 or more (the default is 3), and since you only have 2 peers per site, can you try changing "replication_factor" to 2. This would solve your replication_factor never being met (if this was the issue). The issue of taking down a peer and never recovering search_factor would also be explained by these migrated buckets.

Please see http://docs.splunk.com/Documentation/Splunk/6.1/Indexer/Migratetomultisite#How_the_cluster_migrates_....

View solution in original post

dxu_splunk
Splunk Employee
Splunk Employee

This could be from your migrated non-multisite buckets. We try not to replicate the old buckets (from non-multisite) across sites, but instead leave them on the site we think they originated from. They follow the old "replication_factor" + "search_factor" instead of the new "site_replication_factor" + "site_search_factor".

If your replication_factor is 3 or more (the default is 3), and since you only have 2 peers per site, can you try changing "replication_factor" to 2. This would solve your replication_factor never being met (if this was the issue). The issue of taking down a peer and never recovering search_factor would also be explained by these migrated buckets.

Please see http://docs.splunk.com/Documentation/Splunk/6.1/Indexer/Migratetomultisite#How_the_cluster_migrates_....

1StopBloke
Explorer

That seems to have done the trick. Funnily enough it now shows as 3/3 replicas (based on the cluster rep factor I guess). If I switch back to group rep factor of 3 then the problem comes back. Thanks for your help.

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...