I'm getting read timeout errors for NetApp data collection using the Splunk App for NetApp Data ONTAP, when collecting from filers over a WAN connection. In the "Connection Problems in the Past Hour" report, I get messages like this:
2016-09-22 07:36:41,912 ERROR [ta_ontap_collection_worker://zeta:4535] [QuotaHandler] Problem collecting ontap:quota data from server=somefiler.somesite.mentorg.com : ('The read operation timed out',)
Traceback (most recent call last):
File "/opt/splunk/etc/apps/Splunk_TA_ontap/bin/ta_ontap/handlers.py", line 129, in run results = qa.run()
File "/opt/splunk/etc/apps/Splunk_TA_ontap/bin/ta_ontap/OntapInventory.py", line 106, in run return self.run7Mode()
File "/opt/splunk/etc/apps/Splunk_TA_ontap/bin/ta_ontap/OntapInventoryQuota.py", line 20, in run7Mode self.aggregate_results(data, 'quota-report-iter-start', secondLevelDictArray)
File "/opt/splunk/etc/apps/Splunk_TA_ontap/bin/ta_ontap/OntapInventory.py", line 74, in aggregate_results for x in gen:
File "/opt/splunk/etc/apps/Splunk_TA_ontap/bin/ta_ontap/OntapClient.py", line 521, in query7ModeGen response = naElementToDict(self.queryApi(api, OntapClient.projectResponse))
File "/opt/splunk/etc/apps/Splunk_TA_ontap/bin/ta_ontap/OntapClient.py", line 418, in queryApi response = self.connection.invoke_elem(api)
File "/opt/splunk/etc/apps/Splunk_TA_ontap/bin/ta_ontap/NetApp/NaServer.py", line 483, in invoke_elem xml_response = response.read()
File "/opt/splunk/lib/python2.7/httplib.py", line 593, in read s = self.fp.read()
File "/opt/splunk/lib/python2.7/socket.py", line 355, in read data = self._sock.recv(rbufsize)
File "/opt/splunk/lib/python2.7/ssl.py", line 734, in recv return self.read(buflen)
File "/opt/splunk/lib/python2.7/ssl.py", line 621, in read v = self._sslobj.read(len or 1024) SSLError: ('The read operation timed out',)
The problem happens with filers at multiple sites, but is only occasional - mostly the data is collected as expected. WAN saturation does not appear to be an issue.
The lowest level of (apparent) SSL configurable timeout I could find is in Splunk_TA_ontap/bin/ta_ontap/NetApp/OntapClient.py, it has a CONNECTION_TIMEOUT=30 setting. I have changed that to 60 and will see if it alleviates the issue.
We have filers at 30+ sites, but the DCNs are all in our primary datacenter. Would it be better to have local DCNs and not collect data over the WAN? If I create a bunch of site-local DCNs, will the scheduler automatically know what DCN is "closest" to a filer (or can that be manually configured)?
Thanks!
Lee
... View more