All Apps and Add-ons

Splunk app for NetApp, hydra_gateway: Broken pipe

bochmann
Path Finder

I've been trying to whip up a quick proof-of-conecpt installation of the NetApp app on our existing Splunk enterprise instance (running on Debian/wheezy)... Unfortunately, the data collector doesn't actually seem to want to connect to our cDOT systems - I do see connections from the search head as I add ONTAP collection targets in the app settings, though - presumably checking credentials. (I have the Splunk App for NetApp Data ONTAP 2.0.2 on Splunk 6.2.1)

I set up an additional Splunk heavy forwarder on a CentOS box for the data collector. The scheduler still runs on our Debian search head.

Right now, on the CentOS data collector, there is a Socket error visible in splunkd.log:

12-20-2014 14:46:21.238 +0100 INFO  ExecProcessor - New scheduled exec process: python /opt/splunk/etc/apps/Splunk_TA_ontap/bin/ta_ontap_collection_worker.py
12-20-2014 14:46:21.238 +0100 INFO  ExecProcessor -     interval: run once
12-20-2014 14:46:21.239 +0100 INFO  ExecProcessor - New scheduled exec process: /opt/splunk/bin/splunkd instrument-resource-usage
12-20-2014 14:46:21.239 +0100 INFO  ExecProcessor -     interval: 0 ms
12-20-2014 14:46:21.925 +0100 WARN  HttpListener - Socket error from 127.0.0.1 while accessing /services/hydra/hydra_gatekeeper/hydra_gateway: Broken pipe

I can't find out where that orginates from, though. At the same time, I see the following messages in hydra_scheduler_ta_ontap_collection_scheduler_nidhogg.log on the scheduler:

2014-12-20 14:46:21,286 INFO [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNode] New meta data is distributed: Owner: admin, Namespace: Splunk_TA_ontap, Name: metadata, Id: /servicesNS/nobody/Splunk_TA_ontap/configs/conf-hydra_metadata/metadata.
2014-12-20 14:46:21,286 DEBUG [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNodeManifest] checking the status of all nodes
2014-12-20 14:46:21,293 DEBUG [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNodeManifest] checking health of node=https://172.16.123.12:8089
2014-12-20 14:46:21,331 DEBUG [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNode] no heads regrown after they cried for help on node=https://172.16.123.12:8089
2014-12-20 14:46:21,331 DEBUG [ta_ontap_collection_scheduler://nidhogg] Updated status of active nodes
2014-12-20 14:46:21,331 DEBUG [ta_ontap_collection_scheduler://nidhogg] Checked status of dead nodes
2014-12-20 14:46:21,331 DEBUG [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNodeManifest] checking the status of all nodes
2014-12-20 14:47:16,119 ERROR [ta_ontap_collection_scheduler://nidhogg] [HydraWorkerNode] node=https://172.16.123.12:8089 is likely dead, could not get info on current job count, msg : <urlopen error Tunnel connection failed: 502 cannotconnect>
Traceback (most recent call last):
  File "/opt/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_scheduler.py", line 933, in getActiveJobInfo
    job_info = self.gateway_adapter.get_job_info()
  File "/opt/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_common.py", line 199, in get_job_info
    resp = self.opener.open(req)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 1222, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/opt/splunk/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
URLError: <urlopen error Tunnel connection failed: 502 cannotconnect>

Any hints where I can look to further debug this error?

Possibly unrelated, I have noticed that there still are two ta_ontap_collection_scheduler.py - processes lingering around on the scheduler even after I stop Splunk. Does the scheduler maybe have problems on a Debian host, too (the install docs just note that the data collector has to run on RHEL or CentOS)?

0 Karma
1 Solution

bochmann
Path Finder

Okay, I think I've found the primary problem in my scheduler configuration:

In splunk-launch.conf, I had settings for http_proxy and https_proxy to make Splunk connect to apps.splunk.com through our proxy server.

This seems to have confused some of the NetApp app scripts on the scheduler - I have removed the proxy configuration, and now my data collector picks up perf data from our filer.

View solution in original post

bochmann
Path Finder

Okay, I think I've found the primary problem in my scheduler configuration:

In splunk-launch.conf, I had settings for http_proxy and https_proxy to make Splunk connect to apps.splunk.com through our proxy server.

This seems to have confused some of the NetApp app scripts on the scheduler - I have removed the proxy configuration, and now my data collector picks up perf data from our filer.

bochmann
Path Finder

Throwing strace at the "HttpListener - Socket error from 127.0.0.1 while accessing /services/hydra/hydra_gatekeeper/hydra_gateway: Broken pipe" message on the data collector seems to suggest this happens in a splunkd thread where it tries to write data to an incoming connection (fd 4 is splunkd listening on port 8089):

13:10:31.083367 accept(4, {sa_family=AF_INET, sin_port=htons(48132), sin_addr=inet_addr("127.0.0.1")}, [16]) = 75
13:10:31.083504 setsockopt(75, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[..]
13:10:31.092568 read(75, "7\10M\n\23\233\204\333I\tfx~qp*\251"..., 352) = 352
[..]
13:10:37.913122 write(75, "\27\3\3\3\20\232\200v\363ob:I\24^\230HV"..., 789) = -1 EPIPE (Broken pipe)
13:10:37.913348 --- SIGPIPE (Broken pipe) @ 0 (0) ---
13:10:37.913605 write(3, "12-22-2014 13:10:37.913 +0100 WARN  HttpListener - Socket error from 127.0.0.1 while accessing /services/hydra/hydra_gatekeeper/"..., 155) = 155
13:10:37.913785 epoll_ctl(44, EPOLL_CTL_DEL, 75, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRBAND|EPOLLERR|EPOLLHUP|0xb1fd800, {u32=32761, u64=801602388203962361}}) = 0
13:10:37.913910 close(75) = 0

[edit]
From strings exchanged early in the connection I assume the other endpoint is one of the ta_ontap_collection_worker.py processes. But that one gets terminated a few seconds before receiving this answer (which explains the broken pipe above):

13:10:31.092227 write(4, "\27\3\3\1`7\10M\n\23\233\204\333I\tfx~qp*\251"..., 357) = 357
13:10:31.092437 poll([{fd=4, events=POLLIN}], 1, 30000) = ? ERESTART_RESTARTBLOCK (To be restarted)
13:10:32.260084 --- SIGTERM (Terminated) @ 0 (0) ---

Guess that's far as I get right now, christmas is coming up fast 🙂

0 Karma

Masa
Splunk Employee
Splunk Employee

Once this app was supposed to be working only in CentOS and RedHat due to problem with Hydra feature or Splunk core platform. Current doc seems not to mention it anymore.

Here is an old workaround. This worked in the past. But, I'm not sure if this works in your case. It is still worth trying.

To work around this issue, disable "dash" as the default shell on the system:

sudo dpkg-reconfigure dash

You'll get a prompt asking:

"Use dash as the default system shell?"

Answer no. You may need to restart Hydra or Splunk itself for the issue to be fully resolved.

For additional information on "dash", you can visit the Ubuntu wiki here:

https://wiki.ubuntu.com/DashAsBinSh

Masa
Splunk Employee
Splunk Employee

Also, please make sure port 8080 is open at the DCN

0 Karma

bochmann
Path Finder

Thanks for your answer and the dash hint. Current documentation for the NetApp app still sais: "To build a data collection node: Install a CentOS or RedHat Enterprise Linux version that is supported by Splunk version 5.0.4." (http://docs.splunk.com/Documentation/NetApp/2.0.2/DeployNetapp/InstalltheSplunkAppforNetAppDataONTAP...)

I see nothing listening on port 8080 on the DCN. Which process is supposed to use that?

0 Karma

Masa
Splunk Employee
Splunk Employee

My bad. It was supposed to be 8008 🙂

0 Karma
Get Updates on the Splunk Community!

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

(view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...