I have a cluster which sometimes reports one of the indexers as being off-line (unable to distribute search to... bla bla bla). Usually when I connect to such indexer it is under heavy load so I just assumed that for some reason I didn't have the time so far the jobs piled up on this indexer and it will simply go away - which it usually did.
But today I had this one indexer which seemed offline but it was reported in monitoring console for next two hours or so as offline so I started to take notice.
It turns out that it got stuck on available threads for processing requests since...
# ss -ptn| grep CLOSE-WAIT | wc -l 7056
That's not a normal state for a server. All other indexers had a nice round zero of CLOSE-WAIT connections.
These were all incoming connections to port 8089, they were not from forwarders.
And now I'm perplexed since CLOSE-WAIT is usually a sign of an app error. If it was simply a TIME-WAIT, I'd say those are just some lost FIN/ACK packets, the situation would simply return to normal after a proper timeout. But CLOSE-WAIT?
The patient is 8.1.4 on SLES 12SP3 (kernel 4.4.180-94.100-default)