Hi All-
We have a problem where our SHC Captain seems to stop responding. In looking at netstat and the splunkd logs there are a bunch of CLOSE_WAIT connections that just persist for netstat. In addition in the logs, I see a bunch of errors such as "HttpListener ..... max thread limit for REST HTTP server is 5333."
So, the captain fails to respond to requests and then the cluster just stops working all together. I would have thought the other members would automatically determine a new captain (we have 5 hosts in total for SHC). To remedy I end up having to totally reboot the captain. This brings things back but obviously we want this setup to be more resilient.
I am hesitant to up the server.conf limits for threads because it seems like it will just continue growing. Ideally, we'd see the failure occur on the captain, and it would just get transferred to another host.
Does anyone have any troubleshooting suggestions or remedies?
Thanks!
Did you ever find a solution to this problem? We are having the exact same issue and we haven't been able to figure it out. We are currently on version 9.0 (we've had the issue well before upgrading to 9 though). We are connected to about 6 or 7 different Index Clusters in remote locations and the network reliability for one of them isn't great. We have (for the most part) correlated the captain disconnecting with this one flakey index cluster going down.
I added the :
[httpServer]
maxThreads = -1
to all my search heads this morning. I will see if this helps and if it does, I'll let you know. Could you provide any more details about your setup? I'm interested to see if it's similar to ours. Look forward to hearing from you.