Getting Data In
Highlighted

Splunk daemon stops responding when many connections are made

Communicator

Hi,

We have a Splunk deployment where we expose a splunk app through REST. We connect to the app and run approximately 50 searches every 5 minutes. However, after a while we observe that the searches stop running. Upon debugging we found the following:

In the webservice.log file, we got the following error:

2016-11-06 06:28:32,809 ERROR [581ecd72c57f75d01abbd0] init:479 - Socket error communicating with splunkd (error=('ssl.c:587: The handshake operation timed out',)), path = https://localhost:8089/services/auth/login
2016-11-06 06:29:02,853 ERROR [581ecd90d17f75d014d690] _
init_:479 - Socket error communicating with splunkd (error=('ssl.c:587: The handshake operation timed out',)), path = https://localhost:8089/services/auth/login

In the python.log file, we got:

Splunkd daemon is not responding: ("Error connecting to https://localhost:8089/services/auth/login: ('_ssl.c:587: The handshake operation timed out',)",)

In splunkd.log file, we got the following error:

11-13-2016 23:18:58.662 +0000 ERROR HttpListener - Exception while processing request from 1.1.1.1 for /en-US/custom/XXX/YYY/URLparameters: Connection closed by peer
11-13-2016 23:18:58.662 +0000 ERROR HttpListener - Handler for  /en-US/custom/XXX/YYY/URLparameters sent a 0 byte response after earlier claiming a Content-Length of 26!

We ran the command netstat -neap | grep 8089 and got the following output:

tcp        0      0 0.0.0.0:8089                0.0.0.0:*                   LISTEN      0          136045514
tcp        0      0 127.0.0.1:8089              127.0.0.1:52914             SYN_RECV    0          0
tcp        0      0 127.0.0.1:8089              127.0.0.1:52908             SYN_RECV    0          0
tcp        1      0 127.0.0.1:8089              127.0.0.1:45149             CLOSE_WAIT  0          139900826     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:45984             CLOSE_WAIT  0          140090123     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:42823             CLOSE_WAIT  0          140067803     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:33117             CLOSE_WAIT  0          140194381     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:48657             CLOSE_WAIT  0          139922911     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:60258             CLOSE_WAIT  0          140186060     14993/splunkd 

The ulimit value for the system is set as 4096 and we found event in splunkd.log saying:

11-14-2016 04:13:25.605 +0000 INFO  loader - Limiting REST HTTP server to 1365 sockets
11-14-2016 04:13:25.605 +0000 INFO  loader - Limiting REST HTTP server to 1365 threads 

How do we fix this issue? When does the CLOSE_WAIT status appear for any connection? Any help is greatly appreciated.

Thanks,
Keerthana

0 Karma
Highlighted

Re: Splunk daemon stops responding when many connections are made

SplunkTrust
SplunkTrust

CLOSE_WAIT is a TCP state, you can find an explanation online of what exactly that means.

I would increase your ulimit for the number of file descriptors.

Splunk ulimit troubleshooting documentation says 8192 minimum, I run much, much larger numbers on the indexers.
4192 is not enough for Splunk...

0 Karma
Highlighted

Re: Splunk daemon stops responding when many connections are made

Communicator

This issue could still occur after we increase the ulimit when the number of file descriptors reaches 33% of set ulimit value.

0 Karma
Highlighted

Re: Splunk daemon stops responding when many connections are made

Communicator

Update: All these errors occur after splunk daemon stops responding to requests. Upon further debugging, we found that when splunk daemon stops, the rest.simpleRequest method which is called for authentication times out without closing the connection which results in many threads in CLOSE_WAIT state. We really want to know why exactly splunk daemon stops responding even when the splunk status command shows that splunkd is running.

0 Karma
Highlighted

Re: Splunk daemon stops responding when many connections are made

Influencer

The issue would be due to Global Timeout parameters set in some python libraries.

Its better to set timeout explicitly.

Refer: https://answers.splunk.com/answers/477671/how-to-resolve-lots-of-splunk-internal-connections.html

View solution in original post

0 Karma
Speak Up for Splunk Careers!

We want to better understand the impact Splunk experience and expertise has has on individuals' careers, and help highlight the growing demand for Splunk skills.