Solved: Splunk daemon stops responding when many connectio...

keerthana_k · ‎11-13-2016

Hi,

We have a Splunk deployment where we expose a splunk app through REST. We connect to the app and run approximately 50 searches every 5 minutes. However, after a while we observe that the searches stop running. Upon debugging we found the following:

In the webservice.log file, we got the following error:

2016-11-06 06:28:32,809 ERROR [581ecd72c57f75d01abbd0] init:479 - Socket error communicating with splunkd (error=('ssl.c:587: The handshake operation timed out',)), path = https://localhost:8089/services/auth/login
2016-11-06 06:29:02,853 ERROR [581ecd90d17f75d014d690] __init_:479 - Socket error communicating with splunkd (error=('_ssl.c:587: The handshake operation timed out',)), path = https://localhost:8089/services/auth/login

In the python.log file, we got:

Splunkd daemon is not responding: ("Error connecting to https://localhost:8089/services/auth/login: ('_ssl.c:587: The handshake operation timed out',)",)

In splunkd.log file, we got the following error:

11-13-2016 23:18:58.662 +0000 ERROR HttpListener - Exception while processing request from 1.1.1.1 for /en-US/custom/XXX/YYY/URLparameters: Connection closed by peer
11-13-2016 23:18:58.662 +0000 ERROR HttpListener - Handler for  /en-US/custom/XXX/YYY/URLparameters sent a 0 byte response after earlier claiming a Content-Length of 26!

We ran the command netstat -neap | grep 8089 and got the following output:

tcp        0      0 0.0.0.0:8089                0.0.0.0:*                   LISTEN      0          136045514
tcp        0      0 127.0.0.1:8089              127.0.0.1:52914             SYN_RECV    0          0
tcp        0      0 127.0.0.1:8089              127.0.0.1:52908             SYN_RECV    0          0
tcp        1      0 127.0.0.1:8089              127.0.0.1:45149             CLOSE_WAIT  0          139900826     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:45984             CLOSE_WAIT  0          140090123     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:42823             CLOSE_WAIT  0          140067803     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:33117             CLOSE_WAIT  0          140194381     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:48657             CLOSE_WAIT  0          139922911     14993/splunkd 
tcp        1      0 127.0.0.1:8089              127.0.0.1:60258             CLOSE_WAIT  0          140186060     14993/splunkd

The ulimit value for the system is set as 4096 and we found event in splunkd.log saying:

11-14-2016 04:13:25.605 +0000 INFO  loader - Limiting REST HTTP server to 1365 sockets
11-14-2016 04:13:25.605 +0000 INFO  loader - Limiting REST HTTP server to 1365 threads

How do we fix this issue? When does the CLOSE_WAIT status appear for any connection? Any help is greatly appreciated.

Thanks,
Keerthana

strive · ‎01-06-2017

The issue would be due to Global Timeout parameters set in some python libraries.

Its better to set timeout explicitly.

Refer: https://answers.splunk.com/answers/477671/how-to-resolve-lots-of-splunk-internal-connections.html

View solution in original post

strive · ‎01-06-2017

The issue would be due to Global Timeout parameters set in some python libraries.

Its better to set timeout explicitly.

Refer: https://answers.splunk.com/answers/477671/how-to-resolve-lots-of-splunk-internal-connections.html

keerthana_k · ‎11-14-2016

Update: All these errors occur after splunk daemon stops responding to requests. Upon further debugging, we found that when splunk daemon stops, the rest.simpleRequest method which is called for authentication times out without closing the connection which results in many threads in CLOSE_WAIT state. We really want to know why exactly splunk daemon stops responding even when the splunk status command shows that splunkd is running.

gjanders · ‎11-14-2016

CLOSE_WAIT is a TCP state, you can find an explanation online of what exactly that means.

I would increase your ulimit for the number of file descriptors.

Splunk ulimit troubleshooting documentation says 8192 minimum, I run much, much larger numbers on the indexers.
4192 is not enough for Splunk...

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

keerthana_k · ‎11-14-2016

This issue could still occur after we increase the ulimit when the number of file descriptors reaches 33% of set ulimit value.

Splunk daemon stops responding when many connections are made

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Are you a member of the Splunk Community?

Splunk daemon stops responding when many connections are made

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...