Assuming a worst case scenario: almost all my data is gone because a monkey misconfigured the index retention policies and that get's replicated everywhere.
Fine, there is another backup outside of Splunk's hands and we can restore all our warm and cold buckets (not hot buckets, ok) into one of the indexers and then let Splunk handle the replication.
Great, now the problem is, let's assume our warm and cold buckets are 10 or 20 or 100 TB in size. So Splunk is going to start replicating all of that across the network in order to meet our replication factor:
Any other comments would be great.
I am pretty sure that this will work - mainly because I ran it by my Splunk instructor who also does professional services. 🙂
If you restore warm or cold buckets they will replicate, but thawed data does NOT replicate. Buckets don't really change during the warm to cold to frozen journey - they are just moved from one directory to another. (The .tsidx files not withstanding.)
If you restore them into thawed instead of warm or cold then they will not replicate at all after you thaw them. Now, of course your retention scheme is out the window and you will have to manage that manually, but your inter-site links will not be overwhelmed. If you really want the buckets at a second site, just repeat the process there.
http://docs.splunk.com/Documentation/Splunk/6.3.2/Indexer/Restorearchiveddata : "Data does not get replicated from the thawed directory. So, if you thaw just a single copy of some bucket, instead of all the copies, only that single copy will reside in the cluster, in the thawed directory of the peer node where you placed it. "
Thank you so much for asking.
The only limitation with that approach is the number of backups you need to maintain, which is basically one per indexer, increasing the storage costs drastically in a highly distributed environment. I guess that's still better than having network link issues, but I was still hoping to be able to apply a bit more control from Splunk itself on the way buckets are replicated.
For instance, if you have 20 indexers across 4 sites with a search replication factor of 1,1,1,1 (which means 1 fully searchable copy of your data per site):
In summary, good if there's no other way but expensive.
I don't know about within Splunk, but QoS on the network could throttle it. (I'm not a network engineer, but have had them do similar things before.)
Another option might be to restore them into thawed at each indexer and script a thawing on each site....but that's just a wild hunch.
I keep coming back to this question....it is a very good question!
I should have asked to clarify, but are you multi or single site?
I can't find anything specifically on throttling replication from within Splunk, but maybe you could trick it depending on your settings.
Let's pretend you are multi-site and these are your factors:
site_replication_factor = origin:1,site1:1,site2:1,site3:1,total:3 site_search_factor = origin:1,site1:1,site2:1,site3:1,total:3
If you restored to site1 then you could change the factors to:
site_replication_factor = origin:1,site1:1,site2:1,site3:0,total:2 site_search_factor = origin:1,site1:1,site2:1,site3:0,total:2
After site2 was replicated you could change site3 back to 1.
A little awkward perhaps, but it would let you pick and choose where to replicate if your original factors were high enough. It would be a little trickier on a single site, but you could lower replication_factor and search factor both if they were originally high.
Any chance you could just back up each indexer separately to avoid such replication hazards?
Hi, thanks for your help.
We'll be running a multisite cluster: America, EMEA, APAC.
Your suggestion about the site replication factors is good in principle but I don't think is going to help here as it will force us to stop replication from one of the sites for days or even weeks until all the data has been replicated and then when the site is back online it'll start replicating in both directions (new data from site 1 plus whatever was restored from the backup on site 2).
It also doesn't prevent from killing any single link. I guess we wouldn't be killing two links at the same time but if there's no throttling we could still kill each link independently while the replication is enabled to that site.
I couldn't find anything in the docs either so I was wondering if anyone from Splunk might know something else?
J, what do you think about this in server.conf?
max_peer_rep_load = <integer> * This is the maximum number of concurrent non-streaming replications that a peer can take part in as a target. * Defaults to 5.
There is also this in limits.conf, but it seems much less applicable.
* If specified and not zero, this limits the speed through the thruput
processor to the specified rate in kilobytes per second.
* To control the CPU load while indexing, use this to throttle the number of
events this indexer processes to the rate (in KBps) you specify.
Hi, i use the maxKBps a lot when restricting traffic through heavy and universal forwarders, but I don't know if this setting applies to bucket replication at all.
I've never used the max_peer_rep_load one before so I can't really comment on it. I guess if I limit that to 1 it might alleviate the problem temporarily but I don't know what the impact is going to be to be honest. I'll do some research about it so thanks for thank.
I wish I could give you another vote for all your effort but I can't 🙂
Hi, thanks for that. Our network engineers should be able to apply QoS but I would still like to know whether this is possible from Splunk itself as it would give us a lot more control overall.