This is my first time trying to setup Hadoop Connect, so I may be making some rookie mistakes, but I've hit two different issues that I can't seem to get around while configuring a new HDFS cluster.
The first issue looks to be some kind of classpath issue while running the hadoop command:
Could not find or load main class org.apache.hadoop.fs.FsShell.
That class is provided by one of the jars installed by Cloudera alongside the CLI and works when running the command in the terminal, so it seems to be a classpath issue. It doesn't seem like any of the python code is intentionally setting the classpath differently, but I'm not that familiar with Python so there could be some minutia being missed.
Unfortunately, I'm not seeing that failure any longer, as the second issue has presented and doesn't allow Hadoop Connect to get this far in the process...
The second issue is that Hadoop Connect can't seem to find the hadoop executable:
Unable to connect to Hadoop cluster 'hdfs://metroid/' with principa
Unable to connect to Hadoop cluster 'hdfs://metroid/' with principal 'None': Invalid HADOOP_HOME. Cannot find Hadoop command under bin directory HADOOP_HOME=' /opt/cloudera/parcels/CDH'.
I've configured HADOOP_HOME on the configuration screen to be /opt/cloudera/parcels/CDH. On the same node, these work:
14:17:35 $ ls -l /opt/cloudera/parcels/CDH/bin/hadoop
-rwxr-xr-x 1 root root 621 Aug 30 16:02 /opt/cloudera/parcels/CDH/bin/hadoop
14:17:42 $ /opt/cloudera/parcels/CDH/bin/hadoop
Usage: hadoop [--config confdir] COMMAND
So the executable is there, with appropriate permissions and it works. Just in case the log message was misleading, I looked in the hadooputils.py file, on line 35 it pieces the path together as follows:
hadoop_cli = os.path.join(env["HADOOP_HOME"], "bin", "hadoop")
That looks correct as well, so I'm not sure what's going on. The CDH folder is actually a symlink, so just in case Python was getting confused there I tried the direct path and got the same failure.
Does anyone have a suggestion for how to solve either or (preferably) both of those?
In the file core-site.xml what is the value for fs.defaultFS? Normally we see something like hdfs:// ip : 8020
Are you able to access HDFS from the command line? For example, are you able to run the command hadoop fs -ls hdfs:// ip:8020/users ?