Deployment Architecture

Splunk deployment in AWS with elastic search heads and indexers

Log_wrangler
Builder

I am planning to deploy Splunk (distributed search) in AWS.
Has anyone tested and/or verified if it is possible to setup a deployment so that additional search heads will spin up as more users need access to a search head? For example I have a total of 10 users and after 2 users logon to a search head, the next user will logon to a new search head instance elastically created in AWS. In this example there would be 2 users per SH and if all 10 users were using Splunk there would be 5 search heads elastically deployed. When users logged off the search head then the search head instance would shutdown until one search head remains.

Is this possible? If so where might I find documentation on this.

Thank you

0 Karma
1 Solution

nickhills
Ultra Champion

There is some complexity with doing what your proposing - and whilst I think its possible, you can pretty much rule out it being supported.

One of the challenges is that a Search Head Cluster performs an election to choose a captain. The issue you would have scaling down would be to identify which node was the captain before letting your ASG randomly choose a box to terminate and send the cluster into a flap/reelection. It is however possible to target a machine to retire via the AWS API, but you are going to need some custom API tooling to choose the right node.

Secondly, because a users interaction with a SHC is not stateless, when you retire a box, any users on that system are going to suffer interrupted searches. To work around this, you would need to query the Splunk Rest API and make sure the given box has no active users before targeting this for retirement - however as you have probably guessed, this pretty much rules out using the AWS ASG scale down features but you could write all the scale down logic yourself based off a cloud watch metric, but it needs careful thought.

Scaling up is probably only slightly easier - if you have cloudwatch metric to base your scale up event on, ASG can help you out here - except to scale by the number of active Splunk users, you will probably want to create a custom cloudwatch metric which you populate from the results of the Splunk Rest api to register the number of users per SH, again non trivial, but an alert with a scripted action to register the CW metric could probably help with this.

On scale up of the new system you are going to need a bootstrap process to register your new SH with the pool - a bash script could be used to add the new node (you will probably need a dynamic CName so you the SHC address is predictable) but beware here because you will need to manage admin credentials - and writing them into your script is not ideal.

As i touched on - you will need to manage a predictable address for the cluster - Route53 could help here, but again you will need an AWS api call to register (and purge) records from DNS.

Now assuming you were going to use an ELB to front the SHC, you will need to be careful that the ELB does not evict nodes before they have started fully, as they will go through a few restarts during the registration as they join the cluster, install apps and register with the deployer - the last thing you want is the ELB dropping you new SH before its started up properly.

It could be a fun project, but i suspect the complexity involved could quickly surpass the cost benefits of only running one node per 2 users.

With all of the above said - a single SH can support WAYY more than 2 users (if sized properly) so without doubt your best bet will be to size your SHC based on you anticipated number of users, and adjust this as time goes by (manually) .

If my comment helps, please give it a thumbs up!

View solution in original post

nickhills
Ultra Champion

There is some complexity with doing what your proposing - and whilst I think its possible, you can pretty much rule out it being supported.

One of the challenges is that a Search Head Cluster performs an election to choose a captain. The issue you would have scaling down would be to identify which node was the captain before letting your ASG randomly choose a box to terminate and send the cluster into a flap/reelection. It is however possible to target a machine to retire via the AWS API, but you are going to need some custom API tooling to choose the right node.

Secondly, because a users interaction with a SHC is not stateless, when you retire a box, any users on that system are going to suffer interrupted searches. To work around this, you would need to query the Splunk Rest API and make sure the given box has no active users before targeting this for retirement - however as you have probably guessed, this pretty much rules out using the AWS ASG scale down features but you could write all the scale down logic yourself based off a cloud watch metric, but it needs careful thought.

Scaling up is probably only slightly easier - if you have cloudwatch metric to base your scale up event on, ASG can help you out here - except to scale by the number of active Splunk users, you will probably want to create a custom cloudwatch metric which you populate from the results of the Splunk Rest api to register the number of users per SH, again non trivial, but an alert with a scripted action to register the CW metric could probably help with this.

On scale up of the new system you are going to need a bootstrap process to register your new SH with the pool - a bash script could be used to add the new node (you will probably need a dynamic CName so you the SHC address is predictable) but beware here because you will need to manage admin credentials - and writing them into your script is not ideal.

As i touched on - you will need to manage a predictable address for the cluster - Route53 could help here, but again you will need an AWS api call to register (and purge) records from DNS.

Now assuming you were going to use an ELB to front the SHC, you will need to be careful that the ELB does not evict nodes before they have started fully, as they will go through a few restarts during the registration as they join the cluster, install apps and register with the deployer - the last thing you want is the ELB dropping you new SH before its started up properly.

It could be a fun project, but i suspect the complexity involved could quickly surpass the cost benefits of only running one node per 2 users.

With all of the above said - a single SH can support WAYY more than 2 users (if sized properly) so without doubt your best bet will be to size your SHC based on you anticipated number of users, and adjust this as time goes by (manually) .

If my comment helps, please give it a thumbs up!

View solution in original post

Log_wrangler
Builder

Wow, thank you for the very clear and extensive explanations! As you mention the costs vs benefits for this type of scenario may not be feasible. I was hoping this was already developed by AWS and Splunk. My example of 2 users per search head was just to illustrate the concept of scaling up and scaling down SHs for availability purposes. For now I am going to take your advise and manually adjust the number of search heads.

If you would not mind, giving some more advice...
It is possible to setup a SH in AWS that will expand compute power as needed? For example if I have one SH just for alerts and it needs more cpu power as the number of alerts grows...

Thank you

0 Karma

nickhills
Ultra Champion

This is possible - but not without a bit of disruption.

Lets assume you have a separate orchestration server from which to run some scripts (or you can probably do it with a lamba function for bonus points)

You could watch a CPU metric from somewhere - SNMP, Cloudwatch, Splunk rest API, or even just a splunk search which is monitoring the server load on your alert SH.
However you decide you need more resource, you need to issue a few AWS api commands:
1.) Shutdown your Alert SH (this is the disruption) - ideally do this by hardcoding its instance id to avoid mistakes
2.) Resize the instance id to something bigger
3.) Start the instance.

Aside from a quick reboot, since you don't expect there to be users on this, disruption would be limited. Obviously you can do the inverse if the box has been quiet for a few hours, and resize it down.

Be careful if you plan to reserve the instance (which can be in excess of a $60% saving) If your RI, is for a m4.2xl and you bump upto a m4.4xl, your RI is (maybe) unused

If my comment helps, please give it a thumbs up!
0 Karma

Log_wrangler
Builder

Thank you for the insight.

0 Karma
Take the 2021 Splunk Career Survey

Help us learn about how Splunk has
impacted your career by taking the 2021 Splunk Career Survey.

Earn $50 in Amazon cash!