Solved: Splunk app for NetApp, hydra_gateway: Broken pipe

bochmann · ‎12-20-2014

I've been trying to whip up a quick proof-of-conecpt installation of the NetApp app on our existing Splunk enterprise instance (running on Debian/wheezy)... Unfortunately, the data collector doesn't actually seem to want to connect to our cDOT systems - I do see connections from the search head as I add ONTAP collection targets in the app settings, though - presumably checking credentials. (I have the Splunk App for NetApp Data ONTAP 2.0.2 on Splunk 6.2.1)

I set up an additional Splunk heavy forwarder on a CentOS box for the data collector. The scheduler still runs on our Debian search head.

Right now, on the CentOS data collector, there is a Socket error visible in splunkd.log:

12-20-2014 14:46:21.238 +0100 INFO  ExecProcessor - New scheduled exec process: python /opt/splunk/etc/apps/Splunk_TA_ontap/bin/ta_ontap_collection_worker.py
12-20-2014 14:46:21.238 +0100 INFO  ExecProcessor -     interval: run once
12-20-2014 14:46:21.239 +0100 INFO  ExecProcessor - New scheduled exec process: /opt/splunk/bin/splunkd instrument-resource-usage
12-20-2014 14:46:21.239 +0100 INFO  ExecProcessor -     interval: 0 ms
12-20-2014 14:46:21.925 +0100 WARN  HttpListener - Socket error from 127.0.0.1 while accessing /services/hydra/hydra_gatekeeper/hydra_gateway: Broken pipe

I can't find out where that orginates from, though. At the same time, I see the following messages in hydra_scheduler_ta_ontap_collection_scheduler_nidhogg.log on the scheduler:

2014-12-20 14:46:21,286 INFO [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNode] New meta data is distributed: Owner: admin, Namespace: Splunk_TA_ontap, Name: metadata, Id: /servicesNS/nobody/Splunk_TA_ontap/configs/conf-hydra_metadata/metadata.
2014-12-20 14:46:21,286 DEBUG [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNodeManifest] checking the status of all nodes
2014-12-20 14:46:21,293 DEBUG [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNodeManifest] checking health of node=https://172.16.123.12:8089
2014-12-20 14:46:21,331 DEBUG [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNode] no heads regrown after they cried for help on node=https://172.16.123.12:8089
2014-12-20 14:46:21,331 DEBUG [ta_ontap_collection_scheduler://nidhogg] Updated status of active nodes
2014-12-20 14:46:21,331 DEBUG [ta_ontap_collection_scheduler://nidhogg] Checked status of dead nodes
2014-12-20 14:46:21,331 DEBUG [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNodeManifest] checking the status of all nodes
2014-12-20 14:47:16,119 ERROR [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNode] node=https://172.16.123.12:8089 is likely dead, could not get info on current job count, msg : <urlopen error Tunnel connection failed: 502 cannotconnect>
Traceback (most recent call last):
  File "/opt/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_scheduler.py", line 933, in getActiveJobInfo
    job_info = self.gateway_adapter.get_job_info()
  File "/opt/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_common.py", line 199, in get_job_info
    resp = self.opener.open(req)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 1222, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
URLError: <urlopen error Tunnel connection failed: 502 cannotconnect>

Any hints where I can look to further debug this error?

Possibly unrelated, I have noticed that there still are two ta_ontap_collection_scheduler.py - processes lingering around on the scheduler even after I stop Splunk. Does the scheduler maybe have problems on a Debian host, too (the install docs just note that the data collector has to run on RHEL or CentOS)?

bochmann · ‎01-05-2015

Okay, I think I've found the primary problem in my scheduler configuration:

In splunk-launch.conf, I had settings for http_proxy and https_proxy to make Splunk connect to apps.splunk.com through our proxy server.

This seems to have confused some of the NetApp app scripts on the scheduler - I have removed the proxy configuration, and now my data collector picks up perf data from our filer.

View solution in original post

bochmann · ‎01-05-2015

Okay, I think I've found the primary problem in my scheduler configuration:

In splunk-launch.conf, I had settings for http_proxy and https_proxy to make Splunk connect to apps.splunk.com through our proxy server.

This seems to have confused some of the NetApp app scripts on the scheduler - I have removed the proxy configuration, and now my data collector picks up perf data from our filer.

bochmann · ‎12-22-2014

Throwing strace at the "HttpListener - Socket error from 127.0.0.1 while accessing /services/hydra/hydra_gatekeeper/hydra_gateway: Broken pipe" message on the data collector seems to suggest this happens in a splunkd thread where it tries to write data to an incoming connection (fd 4 is splunkd listening on port 8089):

13:10:31.083367 accept(4, {sa_family=AF_INET, sin_port=htons(48132), sin_addr=inet_addr("127.0.0.1")}, [16]) = 75
13:10:31.083504 setsockopt(75, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[..]
13:10:31.092568 read(75, "7\10M\n\23\233\204\333I\tfx~qp*\251"..., 352) = 352
[..]
13:10:37.913122 write(75, "\27\3\3\3\20\232\200v\363ob:I\24^\230HV"..., 789) = -1 EPIPE (Broken pipe)
13:10:37.913348 --- SIGPIPE (Broken pipe) @ 0 (0) ---
13:10:37.913605 write(3, "12-22-2014 13:10:37.913 +0100 WARN  HttpListener - Socket error from 127.0.0.1 while accessing /services/hydra/hydra_gatekeeper/"..., 155) = 155
13:10:37.913785 epoll_ctl(44, EPOLL_CTL_DEL, 75, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRBAND|EPOLLERR|EPOLLHUP|0xb1fd800, {u32=32761, u64=801602388203962361}}) = 0
13:10:37.913910 close(75) = 0

[edit]
From strings exchanged early in the connection I assume the other endpoint is one of the ta_ontap_collection_worker.py processes. But that one gets terminated a few seconds before receiving this answer (which explains the broken pipe above):

13:10:31.092227 write(4, "\27\3\3\1`7\10M\n\23\233\204\333I\tfx~qp*\251"..., 357) = 357
13:10:31.092437 poll([{fd=4, events=POLLIN}], 1, 30000) = ? ERESTART_RESTARTBLOCK (To be restarted)
13:10:32.260084 --- SIGTERM (Terminated) @ 0 (0) ---

Guess that's far as I get right now, christmas is coming up fast 🙂

Masa · ‎12-21-2014

Once this app was supposed to be working only in CentOS and RedHat due to problem with Hydra feature or Splunk core platform. Current doc seems not to mention it anymore.

Here is an old workaround. This worked in the past. But, I'm not sure if this works in your case. It is still worth trying.

To work around this issue, disable "dash" as the default shell on the system:

sudo dpkg-reconfigure dash

You'll get a prompt asking:

"Use dash as the default system shell?"

Answer no. You may need to restart Hydra or Splunk itself for the issue to be fully resolved.

For additional information on "dash", you can visit the Ubuntu wiki here:

https://wiki.ubuntu.com/DashAsBinSh

Masa · ‎12-21-2014

Also, please make sure port 8080 is open at the DCN

bochmann · ‎12-22-2014

Thanks for your answer and the dash hint. Current documentation for the NetApp app still sais: "To build a data collection node: Install a CentOS or RedHat Enterprise Linux version that is supported by Splunk version 5.0.4." (http://docs.splunk.com/Documentation/NetApp/2.0.2/DeployNetapp/InstalltheSplunkAppforNetAppDataONTAP...)

I see nothing listening on port 8080 on the DCN. Which process is supposed to use that?

Masa · ‎12-22-2014

My bad. It was supposed to be 8008 🙂

Splunk app for NetApp, hydra_gateway: Broken pipe

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Are you a member of the Splunk Community?

Splunk app for NetApp, hydra_gateway: Broken pipe

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...