Solved: Splunk high availability without data duplication

yspendiff1stop · ‎02-19-2014

Hi,

We're putting in splunk in the next few months as a part of PCI compliance. I'm just getting the ball rolling and starting my learning cycle so I'm pretty new to it all.

The first step, which I'm doing, is to architect our splunk deployment. Looking around I'm somewhat baffled to find that there seems to be no way to use shared storage and HA between devices to failover if a node goes down. So what I'm envisioning is iSCSI disk mounts on 2 physical nodes (indexers), one node is active and the other is a standby. If the active goes down the standby takes over. Is this possible with splunk?

Assuming no, as I've not read anything about it (why not?! This is basic stuff) then it seems like my only other options are a HA license to do clones streams to 2 indexers or to use cluster replication. Both of these options use literally twice as much storage space from my interpretation, which seems to fly right in the face of everything learned about de-dupe. I understand this is a performance boon as well, but even on our measly 5GB a day for our first year while we put it in, with a 1 year requirement of data that's an extra 1.8TB just to have HA, if we moved to 10 or 20 then it seems like such a waste of storage. I'm just trying to understand it here.

I've thought about some other options and was wondering if anyone had tried these:

Build a 2 host VMWare cluster and put the indexer on that. If we want to add another indexer then we add another host to the cluster (so N + 1 basically). That way the host has dedicated resources by is redundant to hardware failure.
Use heartbeat or some other open source HA software to manually monitor the process and fail it over. It just seems strange to use old school open source stuff to make HA a product like splunk which is so developed.
Use our hardware load balancers (F5s) to essentially make 1 server active only and only send traffic to the other if the first goes down. But what happens here if I'm running 2 instances of Splunk pointing to the same indices without proper shared storage clustering software, even if only one is reading/writing at a time. Would that cause issues?

Appreciate any help, thanks.

dwaddle · ‎02-20-2014

Shared storage clustering is a pretty well-understood concept in the enterprise and Splunk can be run perfectly well in a shared storage cluster scenario. But, you will have to assemble the pieces yourself and be prepared to diagnose/debug shared disk clustering related issues on your own. Splunk (the company) seemingly does not test Splunk (the product) running in this architecture, and the set of customers running in this architecture is likely a small pool.

One reason for this is because of how Splunk's reference architecture is based on 2U building block commodity boxes with local storage. They suggest this model because of how well search performance horizontally scales in their mapreduce search algorithm. ( http://blogs.splunk.com/2009/10/27/add-a-server-or-two/ )

But nothing about Splunk says you cannot use shared storage. Be aware though, shared storage instead of local storage can turn into a performance bottleneck because of having multiple indexers pounding on the same iSCSI/FC array. You will want to make sure that there is little, if any, contention for the shared spindles. On busy Splunk indexers (which this doesn't sound like) that can mean dedicating 8 (or more) array spindles to each indexer to ensure sustained throughput of 800 iops. (Preferably 1000 iops or even more.)

For customers who have followed the building-blocks with local storage approach, Splunk's index replication clustering gives them good enough availability of the data for search purposes without a whole lot of added cost.

Shared storage is usually orders of magnitude more expensive by the GB than local. By the time you pay for the shared array (with sufficient spindles to avoid disk contention), the interconnect (iSCSI is not as bad as FC), the HBAs, and the management and configuration overhead - you could have bought one, two, or three more indexers and had plenty of local storage to do clustering.

If your shared storage is an existing sunk cost, that will change the economic calculations some. If your company's IT group is dead set on buying enterprise shared storage for any and all applications and you have to no choice but to comply - then the extra 1.8TB required to not have a shared storage cluster is probably a substantial cost.

Splunk's software architecture tries to help you avoid those costs where they can be avoided. But if you are in a shop where you must use shared storage and 1.8TB of shared storage is more expensive than a 2U machine with 12x 300GB drives in a RAID 10 .... then a shared storage cluster makes sense. But you'll have to be prepared to roll your own. Of the roll your own options, your #1 and #2 options make the most sense to me.

And, of course, a shared storage cluster could have a failure in the storage itself, knocking you out entirely. I once had an Oracle cluster using shared disk get wrecked because of corruption of the data in the RAID itself. A tornado came near the site and caused a power surge that affected both power feeds going into the disk array. It was a bad day.

View solution in original post

dwaddle · ‎02-20-2014