Stability issues with db connect 2 'The read opera...

ManfredGrill · ‎11-27-2015

Hi,
I'm running Splunk 6.3.1, db connect 2.0.6. Splunk was updated 2 days ago. This problem already showed up with earlier versions of Splunk Enterprise.

I'm monitoring some tables in an Oracle db, remotely from the Splunk server.
The connection works fine for some days. All data is captured as expected.
At some time db connect can no longer connect to the splunk service. This starts multiple hours (20-50 hours) after I restart the Splunk service.

from dbx.log
11/27/2015 10:04:21 [ERROR] [init.py] Socket error communicating with splunkd (error=('The read operation timed out',)), path = https://127.0.0.1:8089/servicesNS/admin/splunk_app_db_connect/db_connect/connections/PS_diva_db
11/27/2015 10:04:23 [INFO] [mi_base.py] Caught exception Splunkd daemon is not responding: (u"Error connecting to https://127.0.0.1:8089/servicesNS/admin/splunk_app_db_connect/db_connect/connections/PS_diva_db: ('The read operation timed out',)",) in modular input mi_input://OBJECTS_VIEW. Retrying 1 of 6.

I've enable debugging for db connect, but cannot find any hint for this problem. Splunk log has no further info as well. Splunk itself seems to work fine. Once I restart the splunk service db connect is working again for some time.
Most data is captured via tail. The amount of data isn't that big.
Once this problem starts I cannot manage db connect via the Splunk web interface. After restarting the splunk service, the data inputs of db connect are all disabled.
The windows taks list shows multiple python interpreters open when this happens.

How can I dig deeper?
Any suggestions are welcome.
Thanks in advance

ManfredGrill · ‎04-22-2016

The System is running stable since 23/02/2016. (Knock on Wood).

These are the steps that I've done since my last update to this thread

Monitored the tcp Connections for some weeks The splunk sever had around 80 - 100 open The Oracle Server 310-330
Splunk Update to 6.3.3
DB Connect 2 Update to 2.1.3
d:\splunk\etc\system\default\web.conf Splunkdconnectiontimeout Change from 30s to 60s

I can't say for sure what fixed the issue. Either one of the updates or the Splunkdconnectiontimeout Settings.

Let use know of your findings

jkat54 · ‎04-22-2016

Let's mark this as the answer then if you don't mind. If you think my comments helped, feel free to vote on them as well.

ManfredGrill · ‎01-18-2016

I had already changed the value Splunkdconnectiontimeout in web.conf from 30s to 60s. This did not really help.
Sometimes the system runs fine for 2 weeks.
Yesterday the problem showed up again. Unfortunately I've missed the info about the servers running out of ports. I will keep an eye on the values from now on.
Actions I took since the problem started:
- updated db connect 2 to version 2.1.1
- updated splunk to version 6.3.2
- verified ojdbc.jar to fit to the Oracle version that is used

chrisboy68 · ‎04-22-2016

I have ran into that too. I had to put in an Alert just to tell me DB Connect 2 is having issues. Sometimes it happens twice a day, other times I can be up for weeks.

Chirs

tjohnson341 · ‎04-19-2016

We have been seeing this issue for several months now. Socket errors are now also showing up on non-DBConnect REST calls. We have tried all of the suggested solutions with no improvement yet. Any update on your system's status?

vistate · ‎01-02-2016

Hi jkat54 - when i run netstat -an while my server is still running - no error yet - i see i i am at arround port 62731. Can i assume it will begin to screw up if i hit port 65536?

jkat54 · ‎01-02-2016

You can get into what is known as port exhaustion, which is where you just run out of TCP/IP ports. There are only 65535, that's (2^16)-1 ... 16 bit....

The reason i mention port exhaustion is the fact that it takes you random amounts of time before the problem happens and only after your environment is up and running for some amount of time. When you run out of ports you'll all sorts of connection errors. Sometimes during high traffic it will happen, then resolve itself as connections subside, and them sometimes the connections never close and after 65k have been made, exhaustion occurs (faulty code usually causes this condition). I should mention the netstat should run on the DB server too, not just splunk. Could be either server that isnt closing the connections in time / properly / needs to keep them open for other tasks you've got setup, etc.

There are other netstat commands that will show the actual count... or maybe like a netstat -an | wc -l, etc. Check out this article and others around the web for diagnosing port exhaustion.

http://blogs.technet.com/b/askds/archive/2008/10/29/port-exhaustion-and-you-or-why-the-netstat-tool-...

It very well could be your issue. If so, we can visit the code and see if any condition exists where it doesnt close a connection properly.

jkat54 · ‎01-01-2016

Curious what netstat -an shows on the server. Have you exhausted all 2^4 (65536) ports with TIME_WAITs, etc.?

Also, I second that increasing timeout in web.conf may help.

vistate · ‎12-31-2015

Exact same issue - I use Db Inputs for Sybase on 6 tables in 300 second intervals - runs great for days then just dies.

I have tried to increase the timeouts on web.conf - this helped a little bit.

I am running this on my dev box - windows 7 splunk x64 6.3.2

Stability issues with db connect 2 'The read operation timed out'

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

Stability issues with db connect 2 'The read operation timed out'

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases