Deployment Architecture

Splunk Disaster Recovery

gnovak
Builder

I've been researching this topic for a while and am surprised I haven't really found a lot of data on this. I need to come up with a disaster recovery option for if an indexer goes down, like a hardware failure for instance.

In the case of a failed indexer, questions I have are:

-What to do about forwarders who are forwarding their data to an indexer that goes down? Where does the data go and is there an easy way to tell all these forwarders to go somewhere else?
-Who should pick up the slack if an indexer goes down? How do you synch the databases so that they all have the same data? Can you index the same data to more then one indexer?

Has anyone else given this thought? Comments? Suggestions? I'm aware of the splunk page regarding backup but didn't really see one for DR.

Tags (1)
1 Solution

emiller42
Motivator

There is a reference for High Availability that I think would be a good start here.

View solution in original post

jaxjohnny2000
Builder
0 Karma

bmacias84
Champion

This is from my understanding of Splunk HA from a mostly app point of view. Sorry if this seems confusing and I might not have all the information in the right order. Also Splunk 5.x makes this a lot easier.

Q: What to do about forwarders who are forwarding their data to an indexer that goes down? Where does the data go and is there an easy way to tell all these forwarders to go somewhere else?

A: If you only have a single index I would configure two things, Index acknowledgement to prevent in-flight data loss and increase your MaxQueueSize in your output.conf. If you are monitoring File or Log data Splunk will continue from its last know start point which is stored in the fish bucket. You will drop streamed TCP events if your queue is not large enough. Queuesize can be increase from the inputs.conf and outputs.conf.

Index Acknowledgement will prevent against inflight data lost when an index is in failed or unusable state. This setting does have performance implications.

Increasing your MaxQueueSize will allow your forwarder to hold more events in memory. This could be help if you are streaming raw TCP events to a forwarder.

If you have multiple indexers you can configure Splunk’s auto load balance. This will rotate indexer on a time interval to those indexer still responding.

Q: Who should pick up the slack if an indexer goes down?

A: If you are using Splunk’s auto load balancing the remaining Indexers will pick up the slack.

Q: How do you synch the databases so that they all have the same data? Can you index the same data to more than one indexer?

A: Keep in mind Splunk isn't your standard relational database. There are a few of answers to this problem and yes you can index the same raw data multiple times. And to accomplish this you will have to use a combination of the following concepts. Configure data distribution using data cloning, load balancing, and data routing on your forwarders which can be configured from the outputs.conf.

The problem with indexing the same data multiple times is storage and licensing cost (Splunk License is based on indexer through put MB or GB per day).

You could use data cloning to send copies of the events to multiple receiving indexers by configuring your outputs.conf on the forwarder. Keep in mind that that cloning events will have similar search results, but are NOT always exact copies.

You can also install multiple instance of Splunk on a single server if you have extra head room on your servers. Using data distribution you could have a forwarder send events to two physical servers contain two splunk instances each. The first physical server would contain splunk_index1_primary and splunk_index2_secondary and the second physical server would contain splunk_index1_secondary and splunk_index2_primary. On the forwarder you would configure to two data cloning groups on your forwarder.

Output.conf – data cloning with load balancing.


[tcpout]
defaultGroup=cloned_group1,cloned_group2
[tcpout:cloned_group1]
server=splunk_index1_primary:9997, splunk_index2_primary:9997
[tcpout:cloned_group2]
server= splunk_index1_secondary:9997, splunk_index2_secondary:9997

Additional reading:

What is the Fish bucket?

Install mulitple splunk instance on single machine

Setup load balancing

Data coloning

Proctect in-flight data

Data Routing and Filtering

Backups

Configuring outputs.conf

I hope this gets you started or a least helps.

ChrisG
Splunk Employee
Splunk Employee

Great info, and I do want to emphasize your comment that the new index replication feature of Splunk 5.0 makes this much easier!

0 Karma

chris
Motivator

Hi

emiller42s suggestion is an excellent starting point.
It depends on what DR requirements you have? Do you need to recover after a disaster or do you have to be disaster tolerant? Do you need historical data at all times or is it enough if you can keep alerting on the new data that is still being indexed. Or maybe you are ok with a little downtime once in while. Forwarders will not loose any data if an indexer goes down. They will start sending data again when the indexer is available again.

Those are the first steps I'd take to enhance the resiliance of a Splunk installation:

  • Mirror the disks splunk is indexing to -> a (single) disk failure won't hurt anymore

  • If you have data that is sent to your indexers via syslog or any data that is not handled by a forwarder (or a solution whitch makes sure no data is lost in transit -> listening to udp ports is bad) write that data to files and index the files that way you can safely update your indexers

  • Set up more than one indexer and configure the forwarders to do autoloadbalancig (this is easy to set up) between them. If one indexer goes down only the historical data of that indexer will not be available if something happens. Indexing will carry on and alerting/searching the recent data still works

emiller42
Motivator

There is a reference for High Availability that I think would be a good start here.

View solution in original post