I recently joined a project at a place with an ES Search head and a few ad-hoc ones. Apparently, there has been a new issue that started a few days before I got here. The ES SH (others are fine) will go unresponsive for web browsing until it is restarted. The web daemon and splunkd are still running. Alerts and searches are still running.
Splunkd is just bombarded with
"ERROR HttpClientRequest - HTTP client error=Read Timeout while connecting to server" messages and pretty much nothing else. I'm trying to help them figure out what the issue is, but I've never seen something like this happen.
How does the system resource usage on your search heads look? And is there a large number of concurrent searches? You may want to take a look at the "Search Activity: Instance" dashboard in the Monitoring Console to see if there are searches running using a large amount of memory.