Splunk Enterprise

Disaster Recovery for Three Search Head SHC

lenpistoria
Loves-to-Learn

We have a Splunk v9.1.1 cluster with a three search head SHC running on EC2 instances in AWS. In implementing disaster recovery (DR) for the SHC, I configured AWS Autoscaling to replace the search heads on failure. Unfortunately with Autoscaling, it does NOT re-use the IP of the failed instance on the new instance, probably due to other use cases of up- and down-scaling. So new replacement instances will always have new/different IPs than that of the failed instance.

Starting with a healthy cluster with an elected search head captain and RAFT running, I terminated one search head. During the minute or two that it took AWS Autoscaling to replace the search head instance, RAFT stopped and there was no captain. I was then unable to add a NEW third search head to the cluster.

OK, so then I created a similar scenario but this time had Autoscaling issue the commands to force one of the remaining two search heads to be an UN-ELECTED static captain - and then confirmed this had worked; I had two search heads, one being a captain. In the Splunk documentation, it mentions using a Static Captain for DR. However, when I again tried to add the new instance as the third search head, I again received the error that RAFT was not running, there was no cluster, and therefore the member could not be added!

So what is Splunk's recommendation for Disaster Recovery in this situation? I understand this is a chicken-and-egg scenario, but how are you expected to recover if you can't get a third search head in place in order TO recover? It seems counter-intuitive that Splunk would disallow adding a third search head, especially with the static search head captain in place.

There are some configurable timeout parameters in server.conf in the [shcluster] stanza - would increasing any of these values keep the SHC in place long enough for Autoscaling to replace that third search head instance such that it can then join the SHC? If so, which timeouts should I use, and which values would be appropriate that they wouldn't interfere with the day-in, day-out usage?

I'm stuck on this and haven't been able to progress any further. Any and all help is greatly appreciated!

Labels (1)
0 Karma
Get Updates on the Splunk Community!

.conf25 Community Recap

Hello Splunkers, And just like that, .conf25 is in the books! What an incredible few days — full of learning, ...

Splunk App Developers | .conf25 Recap & What’s Next

If you stopped by the Builder Bar at .conf25 this year, thank you! The retro tech beer garden vibes were ...

Congratulations to the 2025-2026 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...