Getting Data In

Splunk daemon stops responding when many connections are made

keerthana_k
Communicator

Hi,

We have a Splunk deployment where we expose a splunk app through REST. We connect to the app and run approximately 50 searches every 5 minutes. However, after a while we observe that the searches stop running. Upon debugging we found the following:

In the webservice.log file, we got the following error:

2016-11-06 06:28:32,809 ERROR [581ecd72c57f75d01abbd0] init:479 - Socket error communicating with splunkd (error=('ssl.c:587: The handshake operation timed out',)), path = https://localhost:8089/services/auth/login
2016-11-06 06:29:02,853 ERROR [581ecd90d17f75d014d690] __init
_:479 - Socket error communicating with splunkd (error=('_ssl.c:587: The handshake operation timed out',)), path = https://localhost:8089/services/auth/login

In the python.log file, we got:

Splunkd daemon is not responding: ("Error connecting to https://localhost:8089/services/auth/login: ('_ssl.c:587: The handshake operation timed out',)",)

In splunkd.log file, we got the following error:

11-13-2016 23:18:58.662 +0000 ERROR HttpListener - Exception while processing request from 1.1.1.1 for /en-US/custom/XXX/YYY/URLparameters: Connection closed by peer
11-13-2016 23:18:58.662 +0000 ERROR HttpListener - Handler for  /en-US/custom/XXX/YYY/URLparameters sent a 0 byte response after earlier claiming a Content-Length of 26!

We ran the command netstat -neap | grep 8089 and got the following output:

tcp        0      0 0.0.0.0:8089                0.0.0.0:*                   LISTEN      0          136045514
tcp        0      0 127.0.0.1:8089              127.0.0.1:52914             SYN_RECV    0          0
tcp        0      0 127.0.0.1:8089              127.0.0.1:52908             SYN_RECV    0          0
tcp        1      0 127.0.0.1:8089              127.0.0.1:45149             CLOSE_WAIT  0          139900826     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:45984             CLOSE_WAIT  0          140090123     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:42823             CLOSE_WAIT  0          140067803     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:33117             CLOSE_WAIT  0          140194381     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:48657             CLOSE_WAIT  0          139922911     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:60258             CLOSE_WAIT  0          140186060     14993/splunkd 

The ulimit value for the system is set as 4096 and we found event in splunkd.log saying:

11-14-2016 04:13:25.605 +0000 INFO  loader - Limiting REST HTTP server to 1365 sockets
11-14-2016 04:13:25.605 +0000 INFO  loader - Limiting REST HTTP server to 1365 threads 

How do we fix this issue? When does the CLOSE_WAIT status appear for any connection? Any help is greatly appreciated.

Thanks,
Keerthana

0 Karma
1 Solution

strive
Influencer

The issue would be due to Global Timeout parameters set in some python libraries.

Its better to set timeout explicitly.

Refer: https://answers.splunk.com/answers/477671/how-to-resolve-lots-of-splunk-internal-connections.html

View solution in original post

0 Karma

strive
Influencer

The issue would be due to Global Timeout parameters set in some python libraries.

Its better to set timeout explicitly.

Refer: https://answers.splunk.com/answers/477671/how-to-resolve-lots-of-splunk-internal-connections.html

View solution in original post

0 Karma

keerthana_k
Communicator

Update: All these errors occur after splunk daemon stops responding to requests. Upon further debugging, we found that when splunk daemon stops, the rest.simpleRequest method which is called for authentication times out without closing the connection which results in many threads in CLOSE_WAIT state. We really want to know why exactly splunk daemon stops responding even when the splunk status command shows that splunkd is running.

0 Karma

gjanders
SplunkTrust
SplunkTrust

CLOSE_WAIT is a TCP state, you can find an explanation online of what exactly that means.

I would increase your ulimit for the number of file descriptors.

Splunk ulimit troubleshooting documentation says 8192 minimum, I run much, much larger numbers on the indexers.
4192 is not enough for Splunk...

Alerts for Splunk Admins https://splunkbase.splunk.com/app/3796/
Version Control for Splunk https://splunkbase.splunk.com/app/4355/
0 Karma

keerthana_k
Communicator

This issue could still occur after we increase the ulimit when the number of file descriptors reaches 33% of set ulimit value.

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!