Deployment Architecture

Splunk high availability without data duplication

yspendiff1stop
Explorer

Hi,

We're putting in splunk in the next few months as a part of PCI compliance. I'm just getting the ball rolling and starting my learning cycle so I'm pretty new to it all.

The first step, which I'm doing, is to architect our splunk deployment. Looking around I'm somewhat baffled to find that there seems to be no way to use shared storage and HA between devices to failover if a node goes down. So what I'm envisioning is iSCSI disk mounts on 2 physical nodes (indexers), one node is active and the other is a standby. If the active goes down the standby takes over. Is this possible with splunk?

Assuming no, as I've not read anything about it (why not?! This is basic stuff) then it seems like my only other options are a HA license to do clones streams to 2 indexers or to use cluster replication. Both of these options use literally twice as much storage space from my interpretation, which seems to fly right in the face of everything learned about de-dupe. I understand this is a performance boon as well, but even on our measly 5GB a day for our first year while we put it in, with a 1 year requirement of data that's an extra 1.8TB just to have HA, if we moved to 10 or 20 then it seems like such a waste of storage. I'm just trying to understand it here.

I've thought about some other options and was wondering if anyone had tried these:

  1. Build a 2 host VMWare cluster and put the indexer on that. If we want to add another indexer then we add another host to the cluster (so N + 1 basically). That way the host has dedicated resources by is redundant to hardware failure.
  2. Use heartbeat or some other open source HA software to manually monitor the process and fail it over. It just seems strange to use old school open source stuff to make HA a product like splunk which is so developed.
  3. Use our hardware load balancers (F5s) to essentially make 1 server active only and only send traffic to the other if the first goes down. But what happens here if I'm running 2 instances of Splunk pointing to the same indices without proper shared storage clustering software, even if only one is reading/writing at a time. Would that cause issues?

Appreciate any help, thanks.

1 Solution

dwaddle
SplunkTrust
SplunkTrust

Shared storage clustering is a pretty well-understood concept in the enterprise and Splunk can be run perfectly well in a shared storage cluster scenario. But, you will have to assemble the pieces yourself and be prepared to diagnose/debug shared disk clustering related issues on your own. Splunk (the company) seemingly does not test Splunk (the product) running in this architecture, and the set of customers running in this architecture is likely a small pool.

One reason for this is because of how Splunk's reference architecture is based on 2U building block commodity boxes with local storage. They suggest this model because of how well search performance horizontally scales in their mapreduce search algorithm. ( http://blogs.splunk.com/2009/10/27/add-a-server-or-two/ )

But nothing about Splunk says you cannot use shared storage. Be aware though, shared storage instead of local storage can turn into a performance bottleneck because of having multiple indexers pounding on the same iSCSI/FC array. You will want to make sure that there is little, if any, contention for the shared spindles. On busy Splunk indexers (which this doesn't sound like) that can mean dedicating 8 (or more) array spindles to each indexer to ensure sustained throughput of 800 iops. (Preferably 1000 iops or even more.)

For customers who have followed the building-blocks with local storage approach, Splunk's index replication clustering gives them good enough availability of the data for search purposes without a whole lot of added cost.

Shared storage is usually orders of magnitude more expensive by the GB than local. By the time you pay for the shared array (with sufficient spindles to avoid disk contention), the interconnect (iSCSI is not as bad as FC), the HBAs, and the management and configuration overhead - you could have bought one, two, or three more indexers and had plenty of local storage to do clustering.

If your shared storage is an existing sunk cost, that will change the economic calculations some. If your company's IT group is dead set on buying enterprise shared storage for any and all applications and you have to no choice but to comply - then the extra 1.8TB required to not have a shared storage cluster is probably a substantial cost.

Splunk's software architecture tries to help you avoid those costs where they can be avoided. But if you are in a shop where you must use shared storage and 1.8TB of shared storage is more expensive than a 2U machine with 12x 300GB drives in a RAID 10 .... then a shared storage cluster makes sense. But you'll have to be prepared to roll your own. Of the roll your own options, your #1 and #2 options make the most sense to me.

And, of course, a shared storage cluster could have a failure in the storage itself, knocking you out entirely. I once had an Oracle cluster using shared disk get wrecked because of corruption of the data in the RAID itself. A tornado came near the site and caused a power surge that affected both power feeds going into the disk array. It was a bad day.

View solution in original post

dwaddle
SplunkTrust
SplunkTrust

Shared storage clustering is a pretty well-understood concept in the enterprise and Splunk can be run perfectly well in a shared storage cluster scenario. But, you will have to assemble the pieces yourself and be prepared to diagnose/debug shared disk clustering related issues on your own. Splunk (the company) seemingly does not test Splunk (the product) running in this architecture, and the set of customers running in this architecture is likely a small pool.

One reason for this is because of how Splunk's reference architecture is based on 2U building block commodity boxes with local storage. They suggest this model because of how well search performance horizontally scales in their mapreduce search algorithm. ( http://blogs.splunk.com/2009/10/27/add-a-server-or-two/ )

But nothing about Splunk says you cannot use shared storage. Be aware though, shared storage instead of local storage can turn into a performance bottleneck because of having multiple indexers pounding on the same iSCSI/FC array. You will want to make sure that there is little, if any, contention for the shared spindles. On busy Splunk indexers (which this doesn't sound like) that can mean dedicating 8 (or more) array spindles to each indexer to ensure sustained throughput of 800 iops. (Preferably 1000 iops or even more.)

For customers who have followed the building-blocks with local storage approach, Splunk's index replication clustering gives them good enough availability of the data for search purposes without a whole lot of added cost.

Shared storage is usually orders of magnitude more expensive by the GB than local. By the time you pay for the shared array (with sufficient spindles to avoid disk contention), the interconnect (iSCSI is not as bad as FC), the HBAs, and the management and configuration overhead - you could have bought one, two, or three more indexers and had plenty of local storage to do clustering.

If your shared storage is an existing sunk cost, that will change the economic calculations some. If your company's IT group is dead set on buying enterprise shared storage for any and all applications and you have to no choice but to comply - then the extra 1.8TB required to not have a shared storage cluster is probably a substantial cost.

Splunk's software architecture tries to help you avoid those costs where they can be avoided. But if you are in a shop where you must use shared storage and 1.8TB of shared storage is more expensive than a 2U machine with 12x 300GB drives in a RAID 10 .... then a shared storage cluster makes sense. But you'll have to be prepared to roll your own. Of the roll your own options, your #1 and #2 options make the most sense to me.

And, of course, a shared storage cluster could have a failure in the storage itself, knocking you out entirely. I once had an Oracle cluster using shared disk get wrecked because of corruption of the data in the RAID itself. A tornado came near the site and caused a power surge that affected both power feeds going into the disk array. It was a bad day.

yspendiff1stop
Explorer

I thought I'd come back here and mention that I've since built our splunk cluster using splunk clustering. What I missed in the splunk clustering documentation is that you can use a replication and search factor to cut down on the amount of nodes that have data copies and also greatly compress some of them, in addition to splunks already good compression.
Also, if you set up active passive when you fail over to the new node it has to run a series of checks against your data before starting up, I was told by a consultant, so it slows down the failover greatly.

dwaddle
SplunkTrust
SplunkTrust

Yeah, a shared storage cluster would be entirely active/passive. While both nodes would have access to the shared disk, only one could have it mounted at a time ... and only one could be running Splunk at once. Your cluster monitor software (cman + rgmanager etc on RHEL forinstance) would do the work of handling mounting of filesystems, floating an IP address, and stopping/starting splunk on the 'active' node. But, as you suggested it may be cleaner running Splunk in a VM and letting the hypervisor handle HA.

0 Karma

yspendiff1stop
Explorer

Hmm I must admit I'd not looked at the reference hardware as I know what we need to use. Namely spare blades from our last refresh and space on one of our SANs. The 1.8TB now is not such a huge deal but when you figure part of our backup strategy is offsite replication (via the SANs), so there's another copy, then to simply run 2 servers is another copy again, then you up our limit to 10GB/d, 20GB/d etc. When you say Splunk can be run perfectly well in a shared storage environment what are you referring to may I ask, as in active/passive or.. can you elabourate please?

0 Karma

RicoSuave
Builder

BOOYAH!!!!

Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...