Solved: Re: What is the best practice for bringing down a ...

myandow · ‎07-08-2016

I'm trying to find some guidance on the best way to bring down a member of a search head cluster without impacting any of the searches that are currently running on it. What I have found in my limited testing is that following this documentation: http://docs.splunk.com/Documentation/Splunk/6.2.2/DistSearch/Removeaclustermember will cause the searches running on the search head to be orphaned and they are no longer viewable in the job manager on the remaining active cluster members.

Is there a best practice when preparing for a scheduled maintenance to bring down a member of a search head cluster so that there will be no impact to the system as a whole?

sloshburch · ‎10-21-2016

Alternate solution based on the discussion we had (finally documenting it):

When doing maintenance, configure the load balancer to stop sending users to the search head that is about to be worked on. That way no new ad hoc searches will be generated that anymore. After about ten minutes (the default ttyl of search artifacts) go ahead and do the maintenance on that host. Then enable the load balancer to send traffic there again. Rinse and repeat with your next shc member.

The shc itself will ensure scheduled searches complete successfully - so you don't have to worry about losing a scheduled search. The load balancer will ensure no ad-hoc searches land on that host. The shc will coordinate reelection of captain as needed.

View solution in original post

sloshburch · ‎10-21-2016

Alternate solution based on the discussion we had (finally documenting it):

When doing maintenance, configure the load balancer to stop sending users to the search head that is about to be worked on. That way no new ad hoc searches will be generated that anymore. After about ten minutes (the default ttyl of search artifacts) go ahead and do the maintenance on that host. Then enable the load balancer to send traffic there again. Rinse and repeat with your next shc member.

The shc itself will ensure scheduled searches complete successfully - so you don't have to worry about losing a scheduled search. The load balancer will ensure no ad-hoc searches land on that host. The shc will coordinate reelection of captain as needed.

myandow · ‎10-24-2016

I just wanted to add that in order to avoid impacting any long running adhoc searches you should first check the search activity in the DMC for the search head in question and verify there are no active adhoc searches. After that has been verified then wait 10 minutes before starting the maintenance.

sloshburch · ‎07-12-2016

It sounds like no one knows, and as such, there might not be a good way to do it. I have heard that this is such a valid question that it's recommended you generate a P4 ticket for such a feature to be added as a Core Improvement Request.

myandow · ‎07-12-2016

As a follow up question, will a rolling restart of the cluster kill off running searches as well?

sloshburch · ‎07-13-2016

It's not documented so let's assume yes (I haven't tested it so I'm not 100% confident).

This is a good opportunity to post on the bottom of the corresponding documentation page to ask this behavior to be articulated.

As such, I've posted your question and a link to this page on the respective page: http://docs.splunk.com/Documentation/Splunk/6.4.2/DistSearch/RestartSHC

The docs team is amazing and I have full confidence they'll be able to help out.

sloshburch · ‎07-22-2016

After hunting down this info even more, some additional details were highlighted:

"for scheduled searches, the captain will try to reschedule if it never gets a SUCCESS from the member. Also, if a node is removed it stops heartbeating to the captain and once it reaches heartbeat_timeout (server.conf) captain removes all the searches delegated to that peer as failed searches leading to retry of search delegation" So that means is there's a rolling restart and a scheduled search gets killed, the captain will see it got killed and kick it off again. If it's an ad hoc search, the user would be navigated to the another SHC member where they have to run it again.
Also, in terms of ad hoc searches: "It runs on the member on which the user executed it, so if that member goes down, the search is lost. Ad hoc search artifacts only stick around for ten minutes (by default), so, by nature, they are highly ephemeral anyway"

Let us know if there are any other questions!

sloshburch · ‎07-25-2016

It might be worth highlighting that I think the ad hoc behavior is the same if we're talking a stand alone SH, SHP or SHC. If Splunk stops, it will stop the search and users would not be notified said search was terminated.

myandow · ‎07-22-2016

We have a lot of users that kick off long running searches and then background them. Are those still considered ad-hoc searches? It is not ideal if a user does this, and then never gets a notification that the search completed or failed to complete, due to a scheduled maintenance.

Steve_G_ · ‎07-22-2016

If the user kicks it off, then, yes, it's an ad hoc search.

What is the best practice for bringing down a search head cluster member?

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?