Splunk Search

Stability issues with db connect 2 'The read operation timed out'

ManfredGrill
Explorer

Hi,
I'm running Splunk 6.3.1, db connect 2.0.6. Splunk was updated 2 days ago. This problem already showed up with earlier versions of Splunk Enterprise.

I'm monitoring some tables in an Oracle db, remotely from the Splunk server.
The connection works fine for some days. All data is captured as expected.
At some time db connect can no longer connect to the splunk service. This starts multiple hours (20-50 hours) after I restart the Splunk service.

from dbx.log
11/27/2015 10:04:21 [ERROR] [init.py] Socket error communicating with splunkd (error=('The read operation timed out',)), path = https://127.0.0.1:8089/servicesNS/admin/splunk_app_db_connect/db_connect/connections/PS_diva_db
11/27/2015 10:04:23 [INFO] [mi_base.py] Caught exception Splunkd daemon is not responding: (u"Error connecting to https://127.0.0.1:8089/servicesNS/admin/splunk_app_db_connect/db_connect/connections/PS_diva_db: ('The read operation timed out',)",) in modular input mi_input://OBJECTS_VIEW. Retrying 1 of 6.

I've enable debugging for db connect, but cannot find any hint for this problem. Splunk log has no further info as well. Splunk itself seems to work fine. Once I restart the splunk service db connect is working again for some time.
Most data is captured via tail. The amount of data isn't that big.
Once this problem starts I cannot manage db connect via the Splunk web interface. After restarting the splunk service, the data inputs of db connect are all disabled.
The windows taks list shows multiple python interpreters open when this happens.

How can I dig deeper?
Any suggestions are welcome.
Thanks in advance

Tags (1)

ManfredGrill
Explorer

The System is running stable since 23/02/2016. (Knock on Wood).

These are the steps that I've done since my last update to this thread

  • Monitored the tcp Connections for some weeks The splunk sever had around 80 - 100 open The Oracle Server 310-330
  • Splunk Update to 6.3.3
  • DB Connect 2 Update to 2.1.3
  • d:\splunk\etc\system\default\web.conf Splunkdconnectiontimeout Change from 30s to 60s

I can't say for sure what fixed the issue. Either one of the updates or the Splunkdconnectiontimeout Settings.

Let use know of your findings

0 Karma

jkat54
SplunkTrust
SplunkTrust

Let's mark this as the answer then if you don't mind. If you think my comments helped, feel free to vote on them as well.

0 Karma

ManfredGrill
Explorer

I had already changed the value Splunkdconnectiontimeout in web.conf from 30s to 60s. This did not really help.
Sometimes the system runs fine for 2 weeks.
Yesterday the problem showed up again. Unfortunately I've missed the info about the servers running out of ports. I will keep an eye on the values from now on.
Actions I took since the problem started:
- updated db connect 2 to version 2.1.1
- updated splunk to version 6.3.2
- verified ojdbc.jar to fit to the Oracle version that is used

0 Karma

chrisboy68
Contributor

I have ran into that too. I had to put in an Alert just to tell me DB Connect 2 is having issues. Sometimes it happens twice a day, other times I can be up for weeks.

Chirs

0 Karma

tjohnson341
Explorer

We have been seeing this issue for several months now. Socket errors are now also showing up on non-DBConnect REST calls. We have tried all of the suggested solutions with no improvement yet. Any update on your system's status?

0 Karma

vistate
Explorer

Hi jkat54 - when i run netstat -an while my server is still running - no error yet - i see i i am at arround port 62731. Can i assume it will begin to screw up if i hit port 65536?

0 Karma

jkat54
SplunkTrust
SplunkTrust

You can get into what is known as port exhaustion, which is where you just run out of TCP/IP ports. There are only 65535, that's (2^16)-1 ... 16 bit....

The reason i mention port exhaustion is the fact that it takes you random amounts of time before the problem happens and only after your environment is up and running for some amount of time. When you run out of ports you'll all sorts of connection errors. Sometimes during high traffic it will happen, then resolve itself as connections subside, and them sometimes the connections never close and after 65k have been made, exhaustion occurs (faulty code usually causes this condition). I should mention the netstat should run on the DB server too, not just splunk. Could be either server that isnt closing the connections in time / properly / needs to keep them open for other tasks you've got setup, etc.

There are other netstat commands that will show the actual count... or maybe like a netstat -an | wc -l, etc. Check out this article and others around the web for diagnosing port exhaustion.

http://blogs.technet.com/b/askds/archive/2008/10/29/port-exhaustion-and-you-or-why-the-netstat-tool-...

It very well could be your issue. If so, we can visit the code and see if any condition exists where it doesnt close a connection properly.

0 Karma

jkat54
SplunkTrust
SplunkTrust

Curious what netstat -an shows on the server. Have you exhausted all 2^4 (65536) ports with TIME_WAITs, etc.?

Also, I second that increasing timeout in web.conf may help.

0 Karma

vistate
Explorer

Exact same issue - I use Db Inputs for Sybase on 6 tables in 300 second intervals - runs great for days then just dies.

I have tried to increase the timeouts on web.conf - this helped a little bit.

I am running this on my dev box - windows 7 splunk x64 6.3.2

0 Karma
Get Updates on the Splunk Community!

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...