Solved: Splunk deployment in AWS with elastic search heads...

Log_wrangler · ‎12-18-2017

I am planning to deploy Splunk (distributed search) in AWS.
Has anyone tested and/or verified if it is possible to setup a deployment so that additional search heads will spin up as more users need access to a search head? For example I have a total of 10 users and after 2 users logon to a search head, the next user will logon to a new search head instance elastically created in AWS. In this example there would be 2 users per SH and if all 10 users were using Splunk there would be 5 search heads elastically deployed. When users logged off the search head then the search head instance would shutdown until one search head remains.

Is this possible? If so where might I find documentation on this.

Thank you

nickhills · ‎12-18-2017

There is some complexity with doing what your proposing - and whilst I think its possible, you can pretty much rule out it being supported.

One of the challenges is that a Search Head Cluster performs an election to choose a captain. The issue you would have scaling down would be to identify which node was the captain before letting your ASG randomly choose a box to terminate and send the cluster into a flap/reelection. It is however possible to target a machine to retire via the AWS API, but you are going to need some custom API tooling to choose the right node.

Secondly, because a users interaction with a SHC is not stateless, when you retire a box, any users on that system are going to suffer interrupted searches. To work around this, you would need to query the Splunk Rest API and make sure the given box has no active users before targeting this for retirement - however as you have probably guessed, this pretty much rules out using the AWS ASG scale down features but you could write all the scale down logic yourself based off a cloud watch metric, but it needs careful thought.

Scaling up is probably only slightly easier - if you have cloudwatch metric to base your scale up event on, ASG can help you out here - except to scale by the number of active Splunk users, you will probably want to create a custom cloudwatch metric which you populate from the results of the Splunk Rest api to register the number of users per SH, again non trivial, but an alert with a scripted action to register the CW metric could probably help with this.

On scale up of the new system you are going to need a bootstrap process to register your new SH with the pool - a bash script could be used to add the new node (you will probably need a dynamic CName so you the SHC address is predictable) but beware here because you will need to manage admin credentials - and writing them into your script is not ideal.

As i touched on - you will need to manage a predictable address for the cluster - Route53 could help here, but again you will need an AWS api call to register (and purge) records from DNS.

Now assuming you were going to use an ELB to front the SHC, you will need to be careful that the ELB does not evict nodes before they have started fully, as they will go through a few restarts during the registration as they join the cluster, install apps and register with the deployer - the last thing you want is the ELB dropping you new SH before its started up properly.

It could be a fun project, but i suspect the complexity involved could quickly surpass the cost benefits of only running one node per 2 users.

With all of the above said - a single SH can support WAYY more than 2 users (if sized properly) so without doubt your best bet will be to size your SHC based on you anticipated number of users, and adjust this as time goes by (manually) .

If my comment helps, please give it a thumbs up!

View solution in original post

nickhills · ‎12-18-2017