We have a Splunk deployment where we expose a splunk app through REST. We connect to the app and run approximately 50 searches every 5 minutes. However, after a while we observe that the searches stop running. Upon debugging we found the following:
In the webservice.log file, we got the following error:
2016-11-06 06:28:32,809 ERROR [581ecd72c57f75d01abbd0] init:479 - Socket error communicating with splunkd (error=('ssl.c:587: The handshake operation timed out',)), path = https://localhost:8089/services/auth/login
2016-11-06 06:29:02,853 ERROR [581ecd90d17f75d014d690] _init_:479 - Socket error communicating with splunkd (error=('ssl.c:587: The handshake operation timed out',)), path = https://localhost:8089/services/auth/login
In the python.log file, we got:
Splunkd daemon is not responding: ("Error connecting to https://localhost:8089/services/auth/login: ('_ssl.c:587: The handshake operation timed out',)",)
In splunkd.log file, we got the following error:
11-13-2016 23:18:58.662 +0000 ERROR HttpListener - Exception while processing request from 18.104.22.168 for /en-US/custom/XXX/YYY/URLparameters: Connection closed by peer 11-13-2016 23:18:58.662 +0000 ERROR HttpListener - Handler for /en-US/custom/XXX/YYY/URLparameters sent a 0 byte response after earlier claiming a Content-Length of 26!
We ran the command
netstat -neap | grep 8089 and got the following output:
tcp 0 0 0.0.0.0:8089 0.0.0.0:* LISTEN 0 136045514 tcp 0 0 127.0.0.1:8089 127.0.0.1:52914 SYN_RECV 0 0 tcp 0 0 127.0.0.1:8089 127.0.0.1:52908 SYN_RECV 0 0 tcp 1 0 127.0.0.1:8089 127.0.0.1:45149 CLOSE_WAIT 0 139900826 14993/splunkd tcp 1 0 127.0.0.1:8089 127.0.0.1:45984 CLOSE_WAIT 0 140090123 14993/splunkd tcp 1 0 127.0.0.1:8089 127.0.0.1:42823 CLOSE_WAIT 0 140067803 14993/splunkd tcp 1 0 127.0.0.1:8089 127.0.0.1:33117 CLOSE_WAIT 0 140194381 14993/splunkd tcp 1 0 127.0.0.1:8089 127.0.0.1:48657 CLOSE_WAIT 0 139922911 14993/splunkd tcp 1 0 127.0.0.1:8089 127.0.0.1:60258 CLOSE_WAIT 0 140186060 14993/splunkd
The ulimit value for the system is set as 4096 and we found event in splunkd.log saying:
11-14-2016 04:13:25.605 +0000 INFO loader - Limiting REST HTTP server to 1365 sockets 11-14-2016 04:13:25.605 +0000 INFO loader - Limiting REST HTTP server to 1365 threads
How do we fix this issue? When does the CLOSE_WAIT status appear for any connection? Any help is greatly appreciated.
CLOSE_WAIT is a TCP state, you can find an explanation online of what exactly that means.
I would increase your ulimit for the number of file descriptors.
Splunk ulimit troubleshooting documentation says 8192 minimum, I run much, much larger numbers on the indexers.
4192 is not enough for Splunk...
This issue could still occur after we increase the ulimit when the number of file descriptors reaches 33% of set ulimit value.
Update: All these errors occur after splunk daemon stops responding to requests. Upon further debugging, we found that when splunk daemon stops, the
rest.simpleRequest method which is called for authentication times out without closing the connection which results in many threads in CLOSE_WAIT state. We really want to know why exactly splunk daemon stops responding even when the
splunk status command shows that splunkd is running.
The issue would be due to Global Timeout parameters set in some python libraries.
Its better to set timeout explicitly.