Deployment Architecture

Hadoop Client Node Configuration

soujanyabargavi
New Member

Assume that there is a Hadoop Cluster that has 20 machines. Out of those 20 machines 18 machines are slaves and machine 19 is for NameNode and machine 20 is for JobTracker.

Now i know that hadoop software has to be installed in all those 20 machines.

but my question is which machine is involved to load a file xyz.txt in to Hadoop Cluster. Is that client machine a separate machine . Do we need to install Hadoop software in that clinet machine as well. How does the client machine identifes Hadoop cluster?

0 Karma

Shankar2677
Loves-to-Learn Lots

i am new to hadoop, so from what I understood:

If your data upload is not an actual service of the cluster, which should be running on an edge node of the cluster, then you can configure your own computer to work as an edge node.

An edge node doesn't need to be known by the cluster (but for security stuff) as it does not store data nor compute job. This is basically what it means to be an edge-node: it is connected to the hadoop cluster but does not participate.

In case it can help someone, here is what I have done to connect to a cluster that I don't administer:

  • get an account on the cluster, say myaccount
  • create an account on you computer with the same name: myaccount
  • configure your computer to access the cluster machines (ssh w\out passphrase, registered ip, ...)
  • get the hadoop configuration files from an edge-node of the cluster
  • get a hadoop distrib (eg. from here)
  • uncompress it where you want, say /home/myaccount/hadoop-x.x
  • add the following environment variables: JAVA_HOME, HADOOP_HOME (/home/me/hadoop-x.x)
  • (if you'd like) add hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
  • replace your hadoop configuration files by those you got from the edge node. With hadoop 2.5.2, it is the folder $HADOOP_HOME/etc/hadoop
  • also, I had to change the value of a couple $JAVA_HOME defined in conf files. To find them use: grep -r "export.*JAVA_HOME"

Then do hadoop fs -ls / which should list the root directory of the cluster hdfs. KBS Training

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

You are correct. A client machine is needed to load the file and you will need the Hadoop libraries to be installed on the client node.
The client node will know how to identifies the Hadoop cluster using the Name Node IP and Port. These days, Task Tracker is not used, so you will need the Yarn Resource Manager IP and Port.

0 Karma
Get Updates on the Splunk Community!

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

The latest enhancements to the Splunk observability portfolio deliver improved SLO management accuracy, better ...

Improve Data Pipelines Using Splunk Data Management

  Register Now   This Tech Talk will explore the pipeline management offerings Edge Processor and Ingest ...

3-2-1 Go! How Fast Can You Debug Microservices with Observability Cloud?

Register Join this Tech Talk to learn how unique features like Service Centric Views, Tag Spotlight, and ...