Solved: Backing up a Distributed Splunk Installation

otes · ‎05-21-2018

Hi,

I am working on a distributed Splunk installation:

(1) Cluster Master (+License Master) VM
(1) Deployment Server (+Monitoring Console) VM
(1) Search Head VM
(3) Indexer VMs (in a cluster)

My question is in regards to backing up data on each non-indexer VM, and restoring it if needed.

I am striving for a high-availability Splunk installation. I understand I cannot run more than one CM or DS, so I am running only one of each. Therefore, if one of these VMs should fail I want it to auto-recover quickly.

(I am aware that I could run a Search Head cluster but the Search Head is not a critical part of the Splunk installation)

My installation is in Amazon Web Services (AWS).

The usual way to provide HA in AWS for a stateful VM that cannot be run in multiplicity for redundancy, is to run the 1 VM in an auto-scaling group of size 1. If the single VM fails, it will be re-launched from a VM template. The template's start-up script will re-install the Splunk server (per its specific type) automatically just as it was originally. This mechanism is working fine.

However, through the course of using and administering Splunk, the state (ex: the .conf files), will change. And in an auto-recover will lose those state changes.

Therefore, I would like to back-up all critical state in a way that is easy to auto-recover.

I have found these references:

(1) http://docs.splunk.com/Documentation/Splunk/7.1.0/Admin/Backupconfigurations
(2) http://docs.splunk.com/Documentation/Splunk/7.1.0/Indexer/Handlemasternodefailure
(3) https://docs.splunk.com/Documentation/Splunk/7.1.0/Admin/BackupKVstore

My questions are:

(A) Is the SPLUNK_HOME/etc directory in fact all I need to backup and recover for the CM, LM, DS and MC functions? Reference (1) implies this, but then reference (3) implies something else that needs to be backed up, so I'm worried I've missed yet something else that needs to be backed up.

(B) For recovery, after Splunk installation but before starting Splunk, is it OK to delete the entire SPLUNK_HOME/etc directory and replace it with the backed up /etc directory contents? (then restart Splunk)?

C) The recovery may be to a new VM. For example, with a new IP address. Do any of the backed-up contents need to be edited during recovery?

For example:

deploymentclient.conf contains a "clientName = " entry, that I assume needs to be updated if the machine's hostname changed?
server.conf contains the "site" setting. The recovered VM may be in a new site, so I assume the "site" setting needs to be updated?

Note that my inter-splunk-vm URLs are all by dynamic DNS name that will not change in a recovery scenario.

I've only noted above values I'm setting during install. I can't know what all Splunk may write to these files during months/years of operation. And I also cannot go through the 3100+ files in the /etc directory to find all occurrences of values that may be server/ip specific. Hence my question even though I've found the above two examples myself.

D) Regarding reference (3): Does this need to be done if I am already backing up the /etc directory? If so, does it only need to be done on the Search Head? And if I periodically run the "backup kvstore" command and save the output, to restore I can just use "restore kvstore" on the replacement VM prior to starting the Splunk service?

E) Reference (2) indicates that on the CM one must preserve the server.conf/sslConfig/sslPassword value during a restore. Does this only apply to the CM? I think every server has this value - it seems to be auto-generated on first run. Why must this value be preserved? I am restoring all of the cert files in /etc, which means none of the ones generated by the new Splunk installation get used (they are replaced with the old certs), so any new sslPassword to a private key file seems useless and in fact wrong when the old certs are restored.

Thank you for your help!

adonio · ‎05-22-2018

hello there,

will try to answer in more detail later, for now:
(A) yes you only need the diag (the etc directory) run splunk diag to backup your configurations. also seems like there are new functionalities around it in 7.1. KV store (if you use it) will be on SH and MC if any.
(B) yes
(C) if you can avoid recovering a new VM it will be best, specially if its the CM. otherwise, you will have to modify configurations on the new VM as well as the machines who "lived" - examples: distributedsearch.conf server.conf (for license) and more
(D) can you elaborate?
(E) yes, maintain the pass4symmkey for connecting recovered CM.

On a general note, i would try to avoid changing VMs of CM, DS, and LM (in your case same machine) and rather recover the bad one if possible as their "down" state, does not effect splunk work.
CM down = no replications and no app pushing - > once up, resync and all good
DS down = no apps push -> once up, all good
LM down = no counts and a warning (you have 72 hours to fix) -> once up, all good.

hope it helps for now

View solution in original post

adonio · ‎05-22-2018

hello there,

will try to answer in more detail later, for now:
(A) yes you only need the diag (the etc directory) run splunk diag to backup your configurations. also seems like there are new functionalities around it in 7.1. KV store (if you use it) will be on SH and MC if any.
(B) yes
(C) if you can avoid recovering a new VM it will be best, specially if its the CM. otherwise, you will have to modify configurations on the new VM as well as the machines who "lived" - examples: distributedsearch.conf server.conf (for license) and more
(D) can you elaborate?
(E) yes, maintain the pass4symmkey for connecting recovered CM.

On a general note, i would try to avoid changing VMs of CM, DS, and LM (in your case same machine) and rather recover the bad one if possible as their "down" state, does not effect splunk work.
CM down = no replications and no app pushing - > once up, resync and all good
DS down = no apps push -> once up, all good
LM down = no counts and a warning (you have 72 hours to fix) -> once up, all good.

hope it helps for now

otes · ‎05-23-2018

Thank you, adonio.

OK, sounds like /etc/* and KV Store are all that need to be backed-up. Can I assume the KV Store is the same on the SH and MC and so I should only backup one, and would a restore be to just one (and they will sync) or is it more complicated than that?

You mention using "splunk diag" for backup. I prefer to do differential backup (using s3cmd) and so do not plan on using "splunk diag". Do you think this is OK or did you mention "splunk diag" because it is necessary?

if you can avoid recovering a new VM it will be best

Sure, but it is a scenario that must be planned for. A corrupt machine cannot be recovered as-is. And restoring from a full machine backup is typically a human-speed recovery. I want a machine-speed recovery, which involves automation that replaces the machine with an equivalent. And in the case of a site-failure, one cannot use the same machine with the same IP address (different sites of course have different subnet range), and even regions in the case of AWS.

I have a License Master and all machines simply refer to the LMin their server.conf.

Really, this whole difficulty seems due to Splunk not using a database server for its configuration like other enterprise tools. But that's a different topic. 🙂

(D) can you elaborate?

This was in regards to the KV Store backup. I didn't know if it was redundant with respect to backing up the /etc directory, or the KV Store is not within /etc and so one must also backup the KV Store using the command to export it.

Assuming it must also be backed up, my plan is to have a cron job run the KV Store export command with the output file being part of my backup. And up on restore the automation would import the KV Store backup file using the corresponding CLI command, before starting the Splunk service on the recovered machine.

(E) yes, maintain the pass4symmkey for connecting recovered CM.

The question was with regards to server.conf/sslConfig/sslPassword and not pass4symmkey. The backup/restore of /etc would recover the pass4symmkey value. Reference (2) implies that upon recovery one should not use the backed-up sslPassword value but instead the new value from the newly installed/generated replacement Splunk installation, which seems wrong to me since it would not match the backed-up/recovered certs.

On a general note, i would try to avoid changing VMs of CM, DS, and LM
(in your case same machine) and rather recover the bad one if possible as
their "down" state, does not effect splunk work. CM down = no
replications and no app pushing - > once up, resync and all good DS down =
no apps push -> once up, all good LM down = no counts and a warning (you
have 72 hours to fix) -> once up, all good.

I have to disagree in the general case.

CM Down: Newly launched hosts (containing the SUF) cannot start sending data until the CM is recovered, assuming using the "indexer discovery" model. That is why I am not using that model, but rather using Dynamic DNS so that hosts connect directly to indexers w/o the CM being required.

DS Down: Newly launched hosts (containing the SUF) cannot obtain their apps and output.conf files, and so will not start sending the correct data until the DS is recovered. Another comment on this topic implies one can run multiple redundant DS' side-by-side behind a load balancer, although I've not found reference to that in the Splunk documentation and in general the Splunk documentation indicates not being compatible with redundancy behind a a load balance. And if one has 2 DS', how does one modify a deployment app or server classes, manually on both? I don't see evidence of them auto-syncing such information.

The example case is a site failure. If the site containing the DS and some hosts fails, any hosts auto-recovering at an alternate site ahead of the DS will not start sending data. I'm sure they buffer, but that information will not be usable in Splunk, and if a host restarts before the DS comes up my guess is data is lost. That seems unacceptable in an HA enterprise environment.

If the Splunk DS function were run like a normal web site/service (multiple stateless servers behind an LB sharing their state in a redundant back-end database server) then it would be HA.

Thank you for sharing your knowledge, which has provided me insight and gotten me farther. Any elaboration or clarification on the above would also be appreciated!

Regards,

Ryan

adonio · ‎05-23-2018

You are most welcome Ryan, and thanks for the detailed comment.
apologies for not addressing your comment in the right order but hopefully it will shed some more light.

in case of a site (which lets say contains both DS, CM and some hosts) failure, the hosts that are coming back up will have hteir last configurations and will send data to indexers as expected.
as for launching new hosts while site is down, i agree, however, why would you launch new hosts when DS is down? in any case, if they have the relevant delpoymentclient.conf they will attempt to connect and will "grab" the relevant outputs when DS is back up.
as for KV Store, they are not the same, the MC has its own.
I think that the main challenge with switching machines is the need to modify configurations in 2 places at the same time, first at the new machine and than on some occasions on other machines as well, for example:
CM down - > new CM (not same machine) -> SH (for example) has to be directed to new CM so it can find the indexers to search -> modify server.conf in SH to point to new CM
if you recover old machine, repeat process.
as for diag, you dont have to go that route, incremental backup of the /etc/ directory will do. i think though that the new feature in 7.1 allows you to easily configure the rotation of the diags to save disk space and maintain updated inventory etc.
when you are saying: "I want a machine-speed recovery, which involves automation that replaces the machine with an equivalent" my understanding is i need High Availability to a very high degree, almost 0 RTO. you can achieve this with the stand by cluster master (you already have the link in your question) and probably very simply with the MC as you will only need to modify its on files - server.conf, distsearch.conf and than run the settings and forwarder settings which i think you can do via REST.
i think that your question raises a broader question regarding the HA / DR strategy of your Cloud install, as i am not sure which provider you use, iirc most of them has some backup / replication mechanisms that you can pay for and you will have your exact same machine in a different zone / data center.
might be worth while to check what the cloud provider can offer here.

reagrds

Ari Donio

otes · ‎05-25-2018

why would you launch new hosts when DS is down?

In a large scale system there are always being new VMs launched. Many teams, many applications, many asynchronous deployments/maintenance jobs... I can't "pause" our whole company's operations because a single Splunk server is having a bad day. 🙂

as for KV Store, they are not the same, the MC has its own.

OK, thank you.

CM down - > new CM (not same machine) -> SH (for example) has to be directed to new CM so it can find the indexers to search -> modify server.conf in SH to point to new CM

The SH only knows the CM's URL and pass4symmkey. The new CM DNS record will auto-update to the replacement CM, and the pass4symmkey stays the same. I don't believe that upon replacing one node any of the other nodes' configuration needs to be updated. This is due to using DNS (not hard-coding IPs) and updating the DNS records if a machine's IP changes.

as for diag, you dont have to go that route, incremental backup of the /etc/ directory will do.

Super, thanks for confirming.

i think that your question raises a broader question regarding the HA / DR strategy
of your Cloud install, as i am not sure which provider you use, iirc most of them has
some backup / replication mechanisms that you can pay for and you will have your
exact same machine in a different zone / data center.

I think this is the "golden machine" recovery model. Always be able to get the bits back, because they are special and can never be re-created. They have untold history from lots of hands SSH'ing in and doing unique things. I don't follow this model. 🙂 The problem is some day you have to re-create on a newer OS, or a second one for testing/staging, and it cannot practically be done.

I believe a more robust model is that a machine can be re-composed from scratch, because it's configuration exists in repeatable processes (CloudFormation, Docker Spec File, Chef, Puppet, etc.) Instances are disposable. Re-create if needed. This scales (organizationally) better.

But I digress. You are probably correct that Splunk favors the "golden machine" model, as it isn't stateless with a database server, but rather squirrels away its ever changing and critical state in the 3100+ files of its /etc directory and doesn't allow for multiple redundant machines behind a load balancer for some of its functions.

I think I have gained a lot and most of what I was hoping for from this thread, thank you.

xpac · ‎05-22-2018

Hey, just a short advice:
There can be multiple deployment servers. You would however need to put them behind some kind of loadbalancer, and also make to sync their config, but besides that, it's possible.
I don't have time and enough knowledge to answer all your other questions, but I'll leave you an upvote, because you wrote a very precise and well done question, thanks!

otes · ‎05-22-2018

Thank you, xpac. I didn't realize that one could run multiple Deployment Servers behind a load balancer. It is good to know that is an option.

xpac · ‎05-22-2018

There are things to consider, but it is possible.

Backing up a Distributed Splunk Installation

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)