Hadoop Cluster Setup

Posted on August 22, 2014November 17, 2014 by ProTechSkills

If you wish to deploy Hadoop Single node setup, please follow the blog here.

Pre-requisites

Before starting with Hadoop cluster setup, make sure that the node have the following pre-requisites:

a) Any Linux Operating system
b) Sun Java 1.6 or above should already be installed and the version should be same across all the nodes. To install Java, you can refer to the installation steps mentioned in the blog.
c) Secure Shell (ssh) must be already installed and its service (sshd) should be running on all the nodes. PasswordLess SSH should be configured from the master node to all the slave nodes.
d) Make sure that all the nodes in the cluster are able to identify other nodes through hostnames. i.e. all nodes are able to resolve HostNames of all the nodes.
e) All nodes in the cluster should have a Hadoop dedicated user / group.

To prepare a node for Hadoop, you can follow the blog here.

The Hadoop cluster setup is summarized as a simple 5 step process.

a) Download and extract Hadoop Tarball bundle from Hadoop repository on the master node.
b) Prepare Hadoop nodes for cluster setup (PasswordLess SSH from master to all the slave machines and HostName resolution)
c) Configure Hadoop environment variables and configuration files.
d) Copy Hadoop directory to all the slave machines.
e) Format NameNode and start DFS & MapReduce services using Hadoop scripts.

Follow the steps below for Hadoop cluster setup on linux machines:

1. Login to your system and download Hadoop 1.x bundle (tar.gz) file from Apache archive (Link for Hadoop 1.2.1).

wget http://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz

2. Move the tar file to the home directory of hduser

mv hadoop-1.2.1.tar.gz /home/hduser

3. Extract the contents of the tar file.

tar -xzvf hadoop-1.2.1.tar.gz
cd /home/hduser/hadoop-1.2.1

4. Configure the Hadoop environment variables in ~/.bashrc (for Ubuntu) or ~/.bash_profile (for CentOS) using any text editor.

sudo nano ~/.bashrc

Append the following variables in this file:

export HADOOP_HOME="/home/hduser/hadoop-1.2.1/"
export PATH=$PATH:$HADOOP_HOME/bin

After saving the file, run the source command to refresh the values of the environment variables.

source ~/.bashrc

5. Edit the /home/hduser/hadoop-1.2.1/conf/core-site.xml file and add the specify the Hadoop HDFS URI (NameNode and its port) to it

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://<HostName_NameNode>:9000</value>
        <final>true</final>
    </property>
</configuration>

6. Edit the /home/hduser/hadoop-1.2.1/conf/hdfs-site.xml file and add the Hadoop HDFS properties:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>

    <property>
        <name>dfs.name.dir</name>
        <value>/home/hduser/hadoop-1.2.1/hadoop_data/dfs/name</value>
    </property>
  
    <property>
        <name>dfs.data.dir</name>
        <value>/home/hduser/hadoop-1.2.1/hadoop_data/dfs/data</value>
    </property>
</configuration>

7. Edit the /home/hduser/hadoop-1.2.1/conf/mapred-site.xml file and specify the host and port for JobTracker daemon. You can choose master node to be used for both NameNode and JobTracker services.

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value><HostName_JobTracker>:9001</value>
    </property>
</configuration>

8. Hadoop requires JAVA_HOME environment variable. You can check the value of JAVA_HOME in your system using the following command:

echo $JAVA_HOME

Edit the /home/hduser/hadoop-1.2.1/conf/hadoop-env.sh file and specify the JAVA_HOME for Hadoop.

export JAVA_HOME=<value for JAVA_HOME variable>

9. Configure information for slave nodes in /home/hduser/hadoop-1.2.1/slaves file. The file should contain either the Host-Names / IP Addresses for the nodes separated by line. That is, each line will contain an entry for slave node.

slavenode1
slavenode2
slavenode3
....

10. Copy the Hadoop directory to all the slave nodes.

for node in `cat <path_for_slaves_file in hadoop_conf_directory>`; do scp -r <hadoop_dir> $node:<parent_directory of hadoop_dir>; done

-- Replace the paths in the above command with valid directories, the command will look like:
for node in `cat /home/hduser/hadoop-1.2.1/slaves`; do scp -r /home/hduser/hadoop-1.2.1 $node:/home/hduser; done

11. Last step before running the Hadoop services is to format the NameNode on the master machine. Before executing the format command, make sure that the dfs.name.dir directory on the NameNode and dfs.data.dir directories on all the slaves does not exists.

hadoop namenode –format

12. Start Hadoop Services using the Hadoop scripts in the /home/hduser/hadoop-1.2.1/bin/ directory.

Services	Command
DFS	start-dfs.sh
MapReduce	start-mapred.sh

Note: If you are using any system directory as a Hadoop directory, then you need to login to all the nodes to start / stop the services individually using hadoop-daemon.sh script.

hadoop-daemon.sh <action> <service_name>

The value of action can be either start / stop. The service_name can be namenode / datanode / jobtracker / tasktracker.

The output of the jps command (list java processes running on a system) will have the service name in the list. If not, you can refer to the service logs under /home/hduser/hadoop-1.2.1/logs directory.

13. Browse to Hadoop HDFS and JobTracker WebUI.

Service	Link
Hadoop HDFS	http://<NameNode machine IP>:50070/
JobTracker	http://<JobTracker machine IP>:50030/

PROTECHSKILLS

Hadoop Cluster Setup

Pre-requisites

Hadoop Cluster Setup

Leave a Reply Cancel reply

Get in touch with us today! Let's Talk About Your Needs.