since information on through High Availability on Search Head level is a little hard to come by, I wanted to summarize what I found in different threads here and discuss a few solutions I have come up with.
What is the Problem?
Search Head pooling requires shared storage via NFS or CIFS (aka Samba). Both options are not really highly available. According to http://splunk-base.splunk.com/answers/28245/what-happens-when-nas-used-for-search-head-pooling-goes-... it will stop working if the storage goes down, making it not true HA.
What about host based mirroring?
Also my first thought, but the file system needs to be writable for all members of the pool, which requires a cluster file system. No luck here according to http://splunk-base.splunk.com/answers/57439/does-splunk-support-search-head-pooling-via-clustered-st...
This is where I'd like to get your input. First of all, in a case of storage failure, both options would mean that all alerts are sent twice.
My next thought was manually synching the directories as a cronjob. Unison would probably be the best choice here, but it will be a pain to configure for more than two heads. For this case I'd probably go for SH1 synching at 5,15,25, etc. minutes; SH2 at 0,10,20 etc. This means, that there will be a delay and some alerts will be sent twice during regular operations two.
Last but not least, I got a more scalable, but also more experimental idea. What about mounting the NFS share over a local copy of the same? This of course still requires a synchronization, but it's only NFS->local, so rsync should do. In linux pseudo shell, this will look like
mount -o bind /local/copy /mnt/shpooling
mount nfsserver:/mnt/shpooling /mnt/shpooling
If the storage goes down, the SH should be able to work on the local copy reasonably well.
Any other ideas or comments about existing ones? Already tried it? Did it work or not?
EDIT: What happens in case of a fail over? Does Splunk reread the configuration if trhe share is not available for a few minutes?
I'll have to choose one of these options within the next few weeks, so I'll be able to try out whatever we think up soon 🙂
Hmmm... I'm not sure if I got your question right, but is there some problem in using e.g. linux cluster to provide HA NFS service for your search heads? Just google for "HA NFS" and you'll find many examples how to configure a pair of linux servers to provide highly available NFS service.
So what is the requirement for failover time and what is your NFS client timeout now?
Long time no reply, I know. The failover time needs to be lower than your NFS timeout I was told. During a failover the fs will be unresponsive and your SH might hang.
Don't worry about fail-over. It's taken care by NFS client and server. Splunk doesn't know anything about it. It's just a filesystem like any other filesystem. At worst case there might be a temporary hick-up but once your NFS server is back in business things will continue as normal.
Thanks once more for the reply.
Do you have any experience how Splunk handles a fail-over? It looks like the connection to the share would be interrupted for a few seconds at least (heartbeat timeout, starting the process, taking over the ip address, relearning routes in the worst case, ...).
You'll need only 2 hosts for HA NFS, and those can serve any number of search heads. So your number of hosts would double only if you have minimal configuration of 2 searchs + 2 NFS servers (if that makes you feel any better). Running HA NFS (and linux cluster) on your search heads is not a good idea.
thanks for the idea. I read https://help.ubuntu.com/community/HighlyAvailableNFS on this topic.
About the actual idea, this would mean just about doubling the number of servers, unless you set this up on the Search Heads themselves. NFS server on both nodes, sharing a drbd-gfs-mirrored mount point.
I also added one more question to the post above.