Deployment Architecture

Bucket recovery from external backup and network bandwidth

SplunkTrust
SplunkTrust

Hi all,

Assuming a worst case scenario: almost all my data is gone because a monkey misconfigured the index retention policies and that get's replicated everywhere.

Fine, there is another backup outside of Splunk's hands and we can restore all our warm and cold buckets (not hot buckets, ok) into one of the indexers and then let Splunk handle the replication.

Great, now the problem is, let's assume our warm and cold buckets are 10 or 20 or 100 TB in size. So Splunk is going to start replicating all of that across the network in order to meet our replication factor:

  • Is it possible to define some throttling for that replication so that it doesn't kill the network link? I know it might then take days to fully replicate, but that's better than impacting other production services.
  • Will Splunk create associated indexes for the RAW data first and then replicate that over the network? Will it just replicate the RAW as I would expect it to do in a normal scenario, and then each indexer would build its local indexes?

Any other comments would be great.

Thanks,
J

Motivator

J,
I am pretty sure that this will work - mainly because I ran it by my Splunk instructor who also does professional services. 🙂

If you restore warm or cold buckets they will replicate, but thawed data does NOT replicate. Buckets don't really change during the warm to cold to frozen journey - they are just moved from one directory to another. (The .tsidx files not withstanding.)

If you restore them into thawed instead of warm or cold then they will not replicate at all after you thaw them. Now, of course your retention scheme is out the window and you will have to manage that manually, but your inter-site links will not be overwhelmed. If you really want the buckets at a second site, just repeat the process there.

http://docs.splunk.com/Documentation/Splunk/6.3.2/Indexer/Restorearchiveddata : "Data does not get replicated from the thawed directory. So, if you thaw just a single copy of some bucket, instead of all the copies, only that single copy will reside in the cluster, in the thawed directory of the peer node where you placed it. "

SplunkTrust
SplunkTrust

Hi,

Thank you so much for asking.

The only limitation with that approach is the number of backups you need to maintain, which is basically one per indexer, increasing the storage costs drastically in a highly distributed environment. I guess that's still better than having network link issues, but I was still hoping to be able to apply a bit more control from Splunk itself on the way buckets are replicated.

For instance, if you have 20 indexers across 4 sites with a search replication factor of 1,1,1,1 (which means 1 fully searchable copy of your data per site):

  • I just need to backup 5 indexers (25% of total) from 1 site in order to protect my infrastructure against the human error.
  • If I skip replication and use the thawed directory, I then need to maintain backups on every single indexer across my 4 sites and then restore each and everyone of them individually.

In summary, good if there's no other way but expensive.

0 Karma

SplunkTrust
SplunkTrust

Hi team,

Would anyone from Splunk be kind enough to take a look at this?
I was unable to find any official statement or documentation about it.

Thanks,
Javier

0 Karma

Motivator

I don't know about within Splunk, but QoS on the network could throttle it. (I'm not a network engineer, but have had them do similar things before.)

Another option might be to restore them into thawed at each indexer and script a thawing on each site....but that's just a wild hunch.

SplunkTrust
SplunkTrust

Apparently my networking team can not apply QoS so any thoughts on Splunk native throttling?

Thanks,
J

0 Karma

Motivator

I keep coming back to this question....it is a very good question!

I should have asked to clarify, but are you multi or single site?

I can't find anything specifically on throttling replication from within Splunk, but maybe you could trick it depending on your settings.

Let's pretend you are multi-site and these are your factors:

site_replication_factor = origin:1,site1:1,site2:1,site3:1,total:3
site_search_factor = origin:1,site1:1,site2:1,site3:1,total:3

If you restored to site1 then you could change the factors to:

site_replication_factor = origin:1,site1:1,site2:1,site3:0,total:2
site_search_factor = origin:1,site1:1,site2:1,site3:0,total:2

After site2 was replicated you could change site3 back to 1.

A little awkward perhaps, but it would let you pick and choose where to replicate if your original factors were high enough. It would be a little trickier on a single site, but you could lower replication_factor and search factor both if they were originally high.

Any chance you could just back up each indexer separately to avoid such replication hazards?

0 Karma

SplunkTrust
SplunkTrust

Hi, thanks for your help.

We'll be running a multisite cluster: America, EMEA, APAC.

Your suggestion about the site replication factors is good in principle but I don't think is going to help here as it will force us to stop replication from one of the sites for days or even weeks until all the data has been replicated and then when the site is back online it'll start replicating in both directions (new data from site 1 plus whatever was restored from the backup on site 2).

It also doesn't prevent from killing any single link. I guess we wouldn't be killing two links at the same time but if there's no throttling we could still kill each link independently while the replication is enabled to that site.

I couldn't find anything in the docs either so I was wondering if anyone from Splunk might know something else?

Thanks,
J

0 Karma

Motivator

J, what do you think about this in server.conf?

max_peer_rep_load = <integer>
* This is the maximum number of concurrent non-streaming
  replications that a peer can take part in as a target.
* Defaults to 5.
0 Karma

Motivator

There is also this in limits.conf, but it seems much less applicable.

maxKBps =
* If specified and not zero, this limits the speed through the thruput
processor to the specified rate in kilobytes per second.
* To control the CPU load while indexing, use this to throttle the number of
events this indexer processes to the rate (in KBps) you specify.

0 Karma

SplunkTrust
SplunkTrust

Hi, i use the maxKBps a lot when restricting traffic through heavy and universal forwarders, but I don't know if this setting applies to bucket replication at all.

I've never used the max_peer_rep_load one before so I can't really comment on it. I guess if I limit that to 1 it might alleviate the problem temporarily but I don't know what the impact is going to be to be honest. I'll do some research about it so thanks for thank.

I wish I could give you another vote for all your effort but I can't 🙂

0 Karma

Motivator

LOL maybe the admins will give me some karma love. LOL

This question is really relevant to what we're doing, too, so if I figure anything out I will let you know.

SplunkTrust
SplunkTrust

Hi, thanks for that. Our network engineers should be able to apply QoS but I would still like to know whether this is possible from Splunk itself as it would give us a lot more control overall.

0 Karma