We have a problem where our SHC Captain seems to stop responding. In looking at netstat and the splunkd logs there are a bunch of CLOSE_WAIT connections that just persist for netstat. In addition in the logs, I see a bunch of errors such as "HttpListener ..... max thread limit for REST HTTP server is 5333."
So, the captain fails to respond to requests and then the cluster just stops working all together. I would have thought the other members would automatically determine a new captain (we have 5 hosts in total for SHC). To remedy I end up having to totally reboot the captain. This brings things back but obviously we want this setup to be more resilient.
I am hesitant to up the server.conf limits for threads because it seems like it will just continue growing. Ideally, we'd see the failure occur on the captain, and it would just get transferred to another host.
Does anyone have any troubleshooting suggestions or remedies?