I'm running Splunk 6.3.1, db connect 2.0.6. Splunk was updated 2 days ago. This problem already showed up with earlier versions of Splunk Enterprise.
I'm monitoring some tables in an Oracle db, remotely from the Splunk server.
The connection works fine for some days. All data is captured as expected.
At some time db connect can no longer connect to the splunk service. This starts multiple hours (20-50 hours) after I restart the Splunk service.
11/27/2015 10:04:21 [ERROR] [init.py] Socket error communicating with splunkd (error=('The read operation timed out',)), path = https://127.0.0.1:8089/servicesNS/admin/splunk_app_db_connect/db_connect/connections/PS_diva_db
11/27/2015 10:04:23 [INFO] [mi_base.py] Caught exception Splunkd daemon is not responding: (u"Error connecting to https://127.0.0.1:8089/servicesNS/admin/splunk_app_db_connect/db_connect/connections/PS_diva_db: ('The read operation timed out',)",) in modular input mi_input://OBJECTS_VIEW. Retrying 1 of 6.
I've enable debugging for db connect, but cannot find any hint for this problem. Splunk log has no further info as well. Splunk itself seems to work fine. Once I restart the splunk service db connect is working again for some time.
Most data is captured via tail. The amount of data isn't that big.
Once this problem starts I cannot manage db connect via the Splunk web interface. After restarting the splunk service, the data inputs of db connect are all disabled.
The windows taks list shows multiple python interpreters open when this happens.
How can I dig deeper?
Any suggestions are welcome.
Thanks in advance
The System is running stable since 23/02/2016. (Knock on Wood).
These are the steps that I've done since my last update to this thread
I can't say for sure what fixed the issue. Either one of the updates or the Splunkdconnectiontimeout Settings.
Let use know of your findings
I had already changed the value Splunkdconnectiontimeout in web.conf from 30s to 60s. This did not really help.
Sometimes the system runs fine for 2 weeks.
Yesterday the problem showed up again. Unfortunately I've missed the info about the servers running out of ports. I will keep an eye on the values from now on.
Actions I took since the problem started:
- updated db connect 2 to version 2.1.1
- updated splunk to version 6.3.2
- verified ojdbc.jar to fit to the Oracle version that is used
We have been seeing this issue for several months now. Socket errors are now also showing up on non-DBConnect REST calls. We have tried all of the suggested solutions with no improvement yet. Any update on your system's status?
You can get into what is known as port exhaustion, which is where you just run out of TCP/IP ports. There are only 65535, that's (2^16)-1 ... 16 bit....
The reason i mention port exhaustion is the fact that it takes you random amounts of time before the problem happens and only after your environment is up and running for some amount of time. When you run out of ports you'll all sorts of connection errors. Sometimes during high traffic it will happen, then resolve itself as connections subside, and them sometimes the connections never close and after 65k have been made, exhaustion occurs (faulty code usually causes this condition). I should mention the netstat should run on the DB server too, not just splunk. Could be either server that isnt closing the connections in time / properly / needs to keep them open for other tasks you've got setup, etc.
There are other netstat commands that will show the actual count... or maybe like a
netstat -an | wc -l, etc. Check out this article and others around the web for diagnosing port exhaustion.
It very well could be your issue. If so, we can visit the code and see if any condition exists where it doesnt close a connection properly.
Exact same issue - I use Db Inputs for Sybase on 6 tables in 300 second intervals - runs great for days then just dies.
I have tried to increase the timeouts on web.conf - this helped a little bit.
I am running this on my dev box - windows 7 splunk x64 6.3.2