Hi,
We're putting in splunk in the next few months as a part of PCI compliance. I'm just getting the ball rolling and starting my learning cycle so I'm pretty new to it all.
The first step, which I'm doing, is to architect our splunk deployment. Looking around I'm somewhat baffled to find that there seems to be no way to use shared storage and HA between devices to failover if a node goes down. So what I'm envisioning is iSCSI disk mounts on 2 physical nodes (indexers), one node is active and the other is a standby. If the active goes down the standby takes over. Is this possible with splunk?
Assuming no, as I've not read anything about it (why not?! This is basic stuff) then it seems like my only other options are a HA license to do clones streams to 2 indexers or to use cluster replication. Both of these options use literally twice as much storage space from my interpretation, which seems to fly right in the face of everything learned about de-dupe. I understand this is a performance boon as well, but even on our measly 5GB a day for our first year while we put it in, with a 1 year requirement of data that's an extra 1.8TB just to have HA, if we moved to 10 or 20 then it seems like such a waste of storage. I'm just trying to understand it here.
I've thought about some other options and was wondering if anyone had tried these:
Appreciate any help, thanks.
Shared storage clustering is a pretty well-understood concept in the enterprise and Splunk can be run perfectly well in a shared storage cluster scenario. But, you will have to assemble the pieces yourself and be prepared to diagnose/debug shared disk clustering related issues on your own. Splunk (the company) seemingly does not test Splunk (the product) running in this architecture, and the set of customers running in this architecture is likely a small pool.
One reason for this is because of how Splunk's reference architecture is based on 2U building block commodity boxes with local storage. They suggest this model because of how well search performance horizontally scales in their mapreduce search algorithm. ( http://blogs.splunk.com/2009/10/27/add-a-server-or-two/ )
But nothing about Splunk says you cannot use shared storage. Be aware though, shared storage instead of local storage can turn into a performance bottleneck because of having multiple indexers pounding on the same iSCSI/FC array. You will want to make sure that there is little, if any, contention for the shared spindles. On busy Splunk indexers (which this doesn't sound like) that can mean dedicating 8 (or more) array spindles to each indexer to ensure sustained throughput of 800 iops. (Preferably 1000 iops or even more.)
For customers who have followed the building-blocks with local storage approach, Splunk's index replication clustering gives them good enough availability of the data for search purposes without a whole lot of added cost.
Shared storage is usually orders of magnitude more expensive by the GB than local. By the time you pay for the shared array (with sufficient spindles to avoid disk contention), the interconnect (iSCSI is not as bad as FC), the HBAs, and the management and configuration overhead - you could have bought one, two, or three more indexers and had plenty of local storage to do clustering.
If your shared storage is an existing sunk cost, that will change the economic calculations some. If your company's IT group is dead set on buying enterprise shared storage for any and all applications and you have to no choice but to comply - then the extra 1.8TB required to not have a shared storage cluster is probably a substantial cost.
Splunk's software architecture tries to help you avoid those costs where they can be avoided. But if you are in a shop where you must use shared storage and 1.8TB of shared storage is more expensive than a 2U machine with 12x 300GB drives in a RAID 10 .... then a shared storage cluster makes sense. But you'll have to be prepared to roll your own. Of the roll your own options, your #1 and #2 options make the most sense to me.
And, of course, a shared storage cluster could have a failure in the storage itself, knocking you out entirely. I once had an Oracle cluster using shared disk get wrecked because of corruption of the data in the RAID itself. A tornado came near the site and caused a power surge that affected both power feeds going into the disk array. It was a bad day.
Shared storage clustering is a pretty well-understood concept in the enterprise and Splunk can be run perfectly well in a shared storage cluster scenario. But, you will have to assemble the pieces yourself and be prepared to diagnose/debug shared disk clustering related issues on your own. Splunk (the company) seemingly does not test Splunk (the product) running in this architecture, and the set of customers running in this architecture is likely a small pool.
One reason for this is because of how Splunk's reference architecture is based on 2U building block commodity boxes with local storage. They suggest this model because of how well search performance horizontally scales in their mapreduce search algorithm. ( http://blogs.splunk.com/2009/10/27/add-a-server-or-two/ )
But nothing about Splunk says you cannot use shared storage. Be aware though, shared storage instead of local storage can turn into a performance bottleneck because of having multiple indexers pounding on the same iSCSI/FC array. You will want to make sure that there is little, if any, contention for the shared spindles. On busy Splunk indexers (which this doesn't sound like) that can mean dedicating 8 (or more) array spindles to each indexer to ensure sustained throughput of 800 iops. (Preferably 1000 iops or even more.)
For customers who have followed the building-blocks with local storage approach, Splunk's index replication clustering gives them good enough availability of the data for search purposes without a whole lot of added cost.
Shared storage is usually orders of magnitude more expensive by the GB than local. By the time you pay for the shared array (with sufficient spindles to avoid disk contention), the interconnect (iSCSI is not as bad as FC), the HBAs, and the management and configuration overhead - you could have bought one, two, or three more indexers and had plenty of local storage to do clustering.
If your shared storage is an existing sunk cost, that will change the economic calculations some. If your company's IT group is dead set on buying enterprise shared storage for any and all applications and you have to no choice but to comply - then the extra 1.8TB required to not have a shared storage cluster is probably a substantial cost.
Splunk's software architecture tries to help you avoid those costs where they can be avoided. But if you are in a shop where you must use shared storage and 1.8TB of shared storage is more expensive than a 2U machine with 12x 300GB drives in a RAID 10 .... then a shared storage cluster makes sense. But you'll have to be prepared to roll your own. Of the roll your own options, your #1 and #2 options make the most sense to me.
And, of course, a shared storage cluster could have a failure in the storage itself, knocking you out entirely. I once had an Oracle cluster using shared disk get wrecked because of corruption of the data in the RAID itself. A tornado came near the site and caused a power surge that affected both power feeds going into the disk array. It was a bad day.
I thought I'd come back here and mention that I've since built our splunk cluster using splunk clustering. What I missed in the splunk clustering documentation is that you can use a replication and search factor to cut down on the amount of nodes that have data copies and also greatly compress some of them, in addition to splunks already good compression.
Also, if you set up active passive when you fail over to the new node it has to run a series of checks against your data before starting up, I was told by a consultant, so it slows down the failover greatly.
Yeah, a shared storage cluster would be entirely active/passive. While both nodes would have access to the shared disk, only one could have it mounted at a time ... and only one could be running Splunk at once. Your cluster monitor software (cman + rgmanager etc on RHEL forinstance) would do the work of handling mounting of filesystems, floating an IP address, and stopping/starting splunk on the 'active' node. But, as you suggested it may be cleaner running Splunk in a VM and letting the hypervisor handle HA.
Hmm I must admit I'd not looked at the reference hardware as I know what we need to use. Namely spare blades from our last refresh and space on one of our SANs. The 1.8TB now is not such a huge deal but when you figure part of our backup strategy is offsite replication (via the SANs), so there's another copy, then to simply run 2 servers is another copy again, then you up our limit to 10GB/d, 20GB/d etc. When you say Splunk can be run perfectly well in a shared storage environment what are you referring to may I ask, as in active/passive or.. can you elabourate please?
BOOYAH!!!!