I'm trying to find some guidance on the best way to bring down a member of a search head cluster without impacting any of the searches that are currently running on it. What I have found in my limited testing is that following this documentation: http://docs.splunk.com/Documentation/Splunk/6.2.2/DistSearch/Removeaclustermember will cause the searches running on the search head to be orphaned and they are no longer viewable in the job manager on the remaining active cluster members.
Is there a best practice when preparing for a scheduled maintenance to bring down a member of a search head cluster so that there will be no impact to the system as a whole?
Alternate solution based on the discussion we had (finally documenting it):
When doing maintenance, configure the load balancer to stop sending users to the search head that is about to be worked on. That way no new ad hoc searches will be generated that anymore. After about ten minutes (the default ttyl of search artifacts) go ahead and do the maintenance on that host. Then enable the load balancer to send traffic there again. Rinse and repeat with your next shc member.
The shc itself will ensure scheduled searches complete successfully - so you don't have to worry about losing a scheduled search. The load balancer will ensure no ad-hoc searches land on that host. The shc will coordinate reelection of captain as needed.
Alternate solution based on the discussion we had (finally documenting it):
When doing maintenance, configure the load balancer to stop sending users to the search head that is about to be worked on. That way no new ad hoc searches will be generated that anymore. After about ten minutes (the default ttyl of search artifacts) go ahead and do the maintenance on that host. Then enable the load balancer to send traffic there again. Rinse and repeat with your next shc member.
The shc itself will ensure scheduled searches complete successfully - so you don't have to worry about losing a scheduled search. The load balancer will ensure no ad-hoc searches land on that host. The shc will coordinate reelection of captain as needed.
I just wanted to add that in order to avoid impacting any long running adhoc searches you should first check the search activity in the DMC for the search head in question and verify there are no active adhoc searches. After that has been verified then wait 10 minutes before starting the maintenance.
It sounds like no one knows, and as such, there might not be a good way to do it. I have heard that this is such a valid question that it's recommended you generate a P4 ticket for such a feature to be added as a Core Improvement Request.
As a follow up question, will a rolling restart of the cluster kill off running searches as well?
It's not documented so let's assume yes (I haven't tested it so I'm not 100% confident).
This is a good opportunity to post on the bottom of the corresponding documentation page to ask this behavior to be articulated.
As such, I've posted your question and a link to this page on the respective page: http://docs.splunk.com/Documentation/Splunk/6.4.2/DistSearch/RestartSHC
The docs team is amazing and I have full confidence they'll be able to help out.
After hunting down this info even more, some additional details were highlighted:
Let us know if there are any other questions!
It might be worth highlighting that I think the ad hoc behavior is the same if we're talking a stand alone SH, SHP or SHC. If Splunk stops, it will stop the search and users would not be notified said search was terminated.
We have a lot of users that kick off long running searches and then background them. Are those still considered ad-hoc searches? It is not ideal if a user does this, and then never gets a notification that the search completed or failed to complete, due to a scheduled maintenance.
If the user kicks it off, then, yes, it's an ad hoc search.