Home » Running Hadoop on Ubuntu Linux (Multi-Node Cluster)

Running Hadoop on Ubuntu Linux (Multi-Node Cluster)

September 18, 2019
, 6:02 pm

As per Hadoop official site, The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

1. PREREQUISITES

1.1. Install Java on master and slaves

Since Apache Hadoop is JAVA framework; JAVA should be installed on our machine in
order to run it on the operating system. Hadoop supports all JAVA versions greater than 5 (i.e.Java
1.5). It should be installed on master and slaves. Before Installing, always check /usr/lib/jvm and
java –version to verify old installation.

$ java -version

To install JAVA use the following commands (you can use java 7, open-JDK …)

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
$ sudo update-java-alternatives -s java-8-oracle

N.B: If you have JDK 1.6 (corresponding to Java 6) or a newer version installed, you should have
a program named jrunscript in your PATH. You can use this command to find the corresponding
JAVA_HOME. Example:

$ jrunscript -e
'java.lang.System.out.println(java.lang.System.getProperty("java.home"));'

1.2. Modify hostname (master and slaves)

To modify the hostname on master and slaves, a reboot is needed
Example:

$ sudo nano /etc/hostname
Master

1.3. Modify hosts (master and slaves)

This file is a simple text file that associates IP addresses with hostnames, one line per IP
address. Modify it by adding all your clusters machines IPs.

$ sudo nano /etc/hosts
192.168.31.142 master localhost
192.168.31.143 slave-1
192.168.31.144 slave-2

1.4. Create a dedicated Hadoop user (master and slaves)

Create a dedicated Hadoop user for accessing HDFS and MapReduce on masters and slaves

# Create hadoopgroup
$ sudo addgroup hadoopgroup
# Create hadoopuser user
$ sudo adduser hadoopuser
# Add hadoopuser to hadoopgroup
$ sudo usermod -g hadoopgroup hadoopuser
# Delete default groupe created by adduser
$ sudo groupdel hadoopuser

1.5. Disable IPv6 (master and slaves)

Since Hadoop doesn’t work on IPv6, it should be disabled. Another reason is that it has been
developed and tested on IPv4 stacks. Hadoop nodes will be able to communicate using IPv4
cluster. (Once you have disabled IPV6 on your machine, you need to reboot your machine in order
to check its effect. In case you don’t know how to reboot with command use sudo reboot)
To get your IPv6 disabled on your Linux machine, you need to update /etc/sysctl.conf by
adding the following lines of codes at end of the file

$ sudo gedit /etc/sysctl.conf
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

1.6. Install SSH (master and slaves)

#install openssh-server
$ sudo apt install openssh-server

1.7. Configure SSH (master only)

To manage (start/stop) all nodes of Master-Slave architecture, HadoopUser (Hadoop user of
Masternode) need to be login on all Slave as well as all Master nodes which can be possible through
setting up passwrdless SSH login.
(If you are not setting this, then you need to provide password while starting and stopping
daemons on Slave nodes from Master node).
On master only:

# Login as hadoopuser
$ su - hadoopuser
#Setup SSH Certificates :
#Generate a ssh key for the user
$ ssh-keygen -t rsa -P ""
#Authorize the key to enable password less ssh
$ cat /home/hadoopuser/.ssh/id_rsa.pub >>
/home/hadoopuser/.ssh/authorized_keys
$ chmod 600 /home/hadoopuser/.ssh/authorized_keys
$ ssh-copy-id -i ~/.ssh/id_rsa.pub master
#Copy this key to slave-1 to enable password less ssh
$ ssh-copy-id -i ~/.ssh/id_rsa.pub slave-1
#Make sure you can do a password less ssh using following command.
$ ssh slave-1
#End current ssh connection
$ exit
#repeat for every slave example:
$ ssh-copy-id -i ~/.ssh/id_rsa.pub slave-2
#always verify that you can do less ssh using following command.
$ ssh slave-2

1.8. Install Hadoop Files (master and slaves)

Download Hadoop file from the official site: http://hadoop.apache.org/ .Try stable Hadoop to get
all latest features as well as recent bugs solved with Hadoop source.
NB: Be careful, if you copy it from another user never forget to change the owner.

$ sudo chown hadoopuser:hadoopgroup /home/hadoopuser/hadoop-2.7.3.tar.gz

Choose location where you want to place all your hadoop installations, I have chosen hadoop2.7.3.tar.gz and /home/hadoopuser

$ cd /home/hadoopuser
$ tar xvf hadoop-2.7.3.tar.gz
$ mv hadoop-2.7.3 hadoop

1.9. Configure .bashrc file (master and slaves)

#update your bashrc by adding to the end :
$ nano ~/.bashrc
export HADOOP_HOME=/home/hadoopuser/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

2. CONFIGURE HADOOP FILES

Before we start getting into configuration details, lets discuss some of the basic terminologies
used in Hadoop.
-Hadoop Distributed File System: A distributed file system that provides high-throughput access to application data. A HDFS cluster primarily consists of a
NameNode that manages the file system metadata and DataNodes that store the actual
data. If you compare HDFS to a traditional storage structures ( e.g. FAT, NTFS), then
NameNode is analogous to a Directory Node structure, and DataNode is analogous to
actual file storage blocks.
– Hadoop YARN: A framework for job scheduling and cluster resource management.
– Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Go to etc/Hadoop in your Hadoop Home folder

$ cd /home/hadoopuser/hadoop/etc/hadoop/

2.1. Configure hadoop-env.sh (master and slaves)

#Update hadoop-env.sh on Master and Slave Nodes
$ nano /home/hadoopuser/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

2.2. Configure core-site.xml (master and slaves)

#Add/update core-site.xml on Master and Slave nodes with following
options.
$ nano core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoopuser/tmp</value>
<description>Temporary Directory.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:54310</value>
<description>Use HDFS as file storage engine</description>
</property>
</configuration>

2.3. Configure mapred-site.xml (master only)

#Add/update mapred-site.xml on Master node only with following
options.
$ cp mapred-site.xml.template mapred-site.xml
$ nano mapred-site.xml
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>The framework for running mapreduce jobs</description>
</property>
</configuration>

2.4. Configure hdfs-site.xml (Master and slaves)

dfs.replication– Here I am using a replication factor of 2. That means for every file stored in HDFS,
there will be one redundant replication of that file on some other node in the cluster.
dfs.namenode.name.dir – This directory is used by Namenode to store its metadata file.
dfs.datanode.data.dir – This directory is used by Datanode to store hdfs data blocks

#Add/update hdfs-site.xml on Master and Slave Nodes. We will be adding
following three entries to the file.
$ nano hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is
created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoopuser/hadoop/hadoop_data/hdfs/namenode</value>
<description>Determines where on the local filesystem the DFS name
node should store the name table(fsimage). If this is a comma-delimited
list of directories then the name table is replicated in all of the
directories, for redundancy.
</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoopuser/hadoop/hadoop_data/hdfs/datanode</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited list of
directories, then data will be stored in all named directories,
typically on different devices. Directories that do not exist are
ignored.
</description>
</property>
</configuration>

2.5. Configure yarn-site.xml (master and slaves)

#Add/update yarn-site.xml on Master and Slave Nodes. This file is
required for a Node to work as a Yarn Node.
$ nano yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
</configuration>

2.6. Configure slaves (master only)

#Add/update slaves file on Master node only. Add just name, or ip
addresses of master and all slave node.
#If file has an entry for localhost, you can remove that.
$ nano slaves
master
slave-1
slave-2

2.7. Format Namenode (master only)

#Format the Namenode
#Before starting the cluster, we need to format the Namenode. Use the
following command only on master node:
$ hdfs namenode -format

2.8. Start all Hadoop daemons (master only)

#Run the following command to start the DFS.
$ start-dfs.sh
#The output of this command should list NameNode, SecondaryNameNode,
DataNode onmaster node, and DataNode on all slave nodes.
#If you don’t see the expected output, review the log files listed in
Troubleshooting section.
#Run the following command to start the Yarn mapreduce framework.
$ start-yarn.sh
#In case of error in hdfs namenode -format after verifying all the
configurations steps never forgot to delete hadoop_data
#$ rm -rf /home/hadoopuser/hadoop/hadoop_data

If you wish to track Hadoop MapReduce as well as HDFS, you can also try exploring
Hadoop web view of ResourceManager and NameNode which are usually used by hadoop
administrators. Open your default browser and visit to the following links from any of the node.
For ResourceManager – Http://master:8088
For NameNode – Http://HadoopMaster:50070

References

[1] “How to configure hadoop 2.7.1 ubuntu 14.04 multinode cluster ?,” 27 2 2016. [Online]. Available:
https://www.researchgate.net/post/how_to_configure_hadoop_271_ubuntu_1404_multinode_cluster.
[Accessed 8 11 2016].
[2] “HADOOP 2.7.0 MULTI NODE CLUSTER SETUP ON UBUNTU 15.04,” 1 5 2016. [Online].
Available: http://chaalpritam.blogspot.com/2015/05/hadoop-270-multi-node-cluster-setup-on.html.
[Accessed 9 11 2016].
[3] “HADOOP 2.6.0 MULTI NODE CLUSTER SETUP ON UBUNTU 14.10,” 2 2 2016. [Online].
Available: http://chaalpritam.blogspot.com/2015/01/hadoop-260-multi-node-cluster-setup-on.html.
[Accessed 9 11 2016].
[4] “How to install Apache Hadoop 2.6.0 in Ubuntu (Multinode/Cluster setup),” 20 4 2015. [Online].
Available: http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/. [Accessed 9 11 2016].
[5] “Running Hadoop on Ubuntu Linux (Multi-Node Cluster),” [Online]. Available: http://www.michaelnoll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/. [Accessed 9 11 2016].
[6] “Hadoop Cluster Setup,” [Online]. Available: https://hadoop.apache.org/docs/r2.7.3/hadoop-projectdist/hadoop-common/ClusterSetup.html. [Accessed 9 11 2016].
[7] G. Turkington, Hadoop Beginner’s Guide, Packt Publishing Ltd, 2013.

Mhamad El Itawi