Update: using a new installed Splunk instance made the difference instead of setting it as a MAster thru clustering gui :
I guess how to create a full automated solution using linux´s Heartbeat and a bunch of scripts, for example ;(
I´m testing this master-stanby on a lab and having problems with the indexers startup and error messages (at the end).
The article that i´m following is on the docs : http://docs.splunk.com/Documentation/Splunk/6.1.3/Indexer/Handlemasternodefailure
Basicly, I made a new server, change it to Master role than stopped the process. Copied master-apps thru rsync , then tested again using tar on the old master and untarring on the new ... same errors.
For the record... with the Search Heads i only needed to point to the new URL thru the Settings->Clustering config.
10-02-2014 15:24:26.434 -0300 INFO CMBundleMgr - Downloaded bundle path=/opt/splunk/var/run/splunk/cluster/remote-bundle/4c74b4f7cd208b4fd98ca7698c1a77db1412274266.bundle time_taken_ms=9.
10-02-2014 15:24:26.434 -0300 INFO CMBundleMgr - untarring bundle=/opt/splunk/var/run/splunk/cluster/remote-bundle/4c74b4f7cd208b4fd98ca7698c1a77db-1412274266.bundle
10-02-2014 15:24:26.441 -0300 INFO ClusterBundleValidator - Validating bundle path=/opt/splunk/var/run/splunk/cluster/remote-bundle/4c74b4f7cd208b4fd98ca7698c1a77db-1412274266/apps
10-02-2014 15:24:26.460 -0300 INFO CMBundleMgr - Removed the untarred bundle folder=/opt/splunk/var/run/splunk/cluster/remote-bundle/4c74b4f7cd208b4fd98ca7698c1a77db-1412274266
10-02-2014 15:24:26.460 -0300 INFO CMBundleMgr - Removed the bundle downloaded from master to '/opt/splunk/var/run/splunk/cluster/remote-bundle/4c74b4f7cd208b4fd98ca7698c1a77db-1412274266.bundle'
I think the answer here is that Splunk needs to enhance the HA features of the cluster master, have a look at my older question here;
There is some good detail in the answers there from other people.
We experience the same time to recover a Cluster Master as Dolxor, about 3-4 hours for 8 indexers of 800GB each. It takes about 30-45 minutes per cluster peer to finish its processing and the Cluster master processes each sequentially.
Cluster master keeps only the meta data, so in order to bring up secondary master we need some basic static files like server.conf. All other information such as indexes, buckets etc can be dynamically rebuild by secondary master. More on this is documented here
Dolxor, I'm surprised that it took 4 - 6 hours for the cluster master to catch up. If all the peers are available and you just did a cluster master restart, it should get back to normal state in few mins. If you notice this again, please open a support case so we can get to the bottom of the issue
What Splunk should allow was for us to use the same license, index the data to one local and one off-site Splunk Cluster with same hardware. If your local Cluster Master dies horribly, your search head could use the off-site Cluster Master towards the off-site Cluster to allow for continued operations while you fix your local cluster master.
You can do that today, it is possible. But the cost is double. For some of us, cost is a reality we cannot ignore.
Clustering is always just 1 step in reducing your exposure to failures. A cluster will in practice only protect you from technical failures with hardware. It does nothing to prevent logical corruption or other logical issues within the application or its logic. If you would like to have full disaster tolerancy (physical and logical) for you entire Splunk setup, good luck... I doubt it will be possible at all. Maybe when using various techniques combined with true non-stop hardware (HP Non-Stop / Himalaya, type of equipment) clusters, then maybe it is possible but the risk vs benefits vs cost, calculation will be a hard sell within any organisation. Perhaps Splunk US has some reference architecture documentation about the HA solution that Apple is using.
If your (primary) Cluster Master dies horribly, would all this fix this? The new hot-standby (or whatever) Cluster Master then needs to take over the role. Last time this happened to us (the primary Cluster Master was fixed and booted up again) - it took us about 4-6 hours of nervous waiting time for the cluster master to 'fix' stuff in the cluster before everything was neat and nice again.
A hot-standby cluster master would still need to do all this "fixing" stuff before you can use the cluster for searching (indexing was still OK during our downtime). So you do not fix the underlying problem, and the cluster master is still a very serious single-point-of-failure in a Splunk cluster scenario.
If you are looking for a high available solution, without OS clustering there is also the option of running a VM. If your virtualization platform is already highly available and you have to option of using it, why not host the cluster manager role on there? Running a hot-standby VM is possible with either VMware ESX, Hyper-V, XenServer and I would assume most other solutions also... and don't forget, Splunk failover functionality is impacted by the loss of the cluster manager, but no 'real' functionality is lost with a momentary loss of the CM. So a hot or warm stand-by CM would probably be enough to meet most HA demands.
The cluster manager is only handling some basic management tasks, so no real heavy loads. Since combined setups are supported by Splunk, it would not be a problem to only host you Cluster Manager on FO-Cluster and run all 'real' loads on a different OS. Since I am a former MS employee, Windows clustering is within my comfort zone and thus I suggested it. I would however in all fairness have to say, that if you don't have in-depth Windows knowledge and you are already used to working with some other OS, it would of course make more sense to just run a cluster solution based on you own preferred OS. I am however assuming that the concept of a FO-Cluster in Windows, has a equivalent solution (functionality-wise) in the OS you are intending to use.
Martijn (Omni-it is just my company name. )
I have 10 year background with windows cluster, and have mixed feelings.
Both in the preformance we got from windows as base os for Splunk, and the complexisty clustering i windows gives.
Second, we was hit by an sw issue that created an race like issue in splunk.
Im looking for good designs/ideas that might address this.
We would be equaly stuck with an os clustering.
Thanks for all input so far.
You could setup a Windows failover cluster specific for this role and define the Splunk process as a custom service. It will then failover when your primairy node or a dependency (Network/Disk) fails.
Anyone realy running it on windows bokses ? Wow..
Well, thanks for pointing out that it is possible.
Do windows cluster handle filelocking and would handle an bug in cluster manager ?
Windows cluistering have some many possible options to fail, and add way to much complexity.
I did consider something like beowulf cluster, but dropped it.
Is there no arcitectural way to design ourself away ?