How to setup a basic Hadoop cluster

This article is a guide to setup a Hadoop cluster. The cluster runs on local CentOS virtual machines using Virtualbox. I use this to have a local environment for development and testing.

I followed many of the steps Austin Ouyang laid out in the blog post here. Hopefully, next I can document using moving these virtual machines to another cloud provider.

Prerequisites

It assumes you are using the following software versions.

  • MacOS 10.11.3
  • Vagrant 1.8.5
  • Java 1.8.0 (Using JRE is fine, use the JDK to run MapReduce examples later)
  • Hadoop 2.7.3

Here are the steps I used:

  1. First, create a workspace.

    mkdir -p ~/vagrant_boxes/hadoop

    cd ~/vagrant_boxes/hadoop

  2. Next, create a new vagrant box. I'm using a minimal CentOS vagrant box.

    vagrant box add "CentOS 6.5 x86_64" https://github.com/2creatives/vagrant-centos/releases/download/v6.5.3/centos65-x86_64-20140116.box

  3. We are going to create a vagrant box with the packages we need. So, first we initialize the vagrant box.

    vagrant init -m "CentOS 6.5 x86_64" hadoop_base

  4. Next, change the Vagrantfile to the following:

    Vagrant.configure(2) do |config|
    	config.vm.box = "CentOS 6.5 x86_64"
    	config.vm.box_url = "hadoop_base"
    	config.ssh.insert_key = false
    end
    

  5. Now, install Hadoop and it's dependencies.

    vagrant up

    vagrant ssh

    sudo yum install java-1.8.0-openjdk-devel

    sudo yum install wget

    wget http://apache.claz.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz ~

    gunzip -c *gz | tar xvf -

  6. Open up your ~/.bash_profile and append the following lines.

      export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk.x86_64
      export PATH=$PATH:$JAVA_HOME/bin
      export HADOOP_HOME=~/hadoop-2.7.3
      export PATH=$PATH:$HADOOP_HOME/bin
      export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    

  7. Source the profile.

    source ~/.bash_profile

  8. In /etc/hosts, add the following lines:

      192.168.50.21 namenode.example.com
      192.168.50.22 datanode1.example.com
      192.168.50.23 datanode2.example.com
      192.168.50.24 datanode3.example.com
      192.168.50.25 datanode4.example.com
    

  9. In $HADOOP_CONF_DIR/hadoop-env.sh, replace the ${JAVA_HOME} variable.

      # The java implementation to use.
      export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk.x86_64
    

  10. Edit the $HADOOP_CONF_DIR/core-site.xml file to have the following XML:

      <configuration>
        <property>
          <name>fs.defaultFS</name>
          <value>hdfs://namenode.example.com:9000</value>
        </property>
      </configuration>
    

  11. Edit the $HADOOP_CONF_DIR/yarn-site.xml file to have the following XML:

      <configuration>
        <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
        </property>
        <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
          <name>yarn.resourcemanager.hostname</name>
          <value>namenode.example.com</value>
        </property>
      </configuration>
    

  12. Now, copy the mapred-site.xml file from a template.

    cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml

  13. Edit the $HADOOP_CONF_DIR/mapred-site.xml to have the following XML:

      <configuration>
        <property>
          <name>mapreduce.jobtracker.address</name>
          <value>namenode.example.com:54311</value>
        </property>
        <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
        </property>
      </configuration>
    

  14. Edit the $HADOOP_CONF_DIR/hdfs-site.xml file to have the following XML:

      <configuration>
        <property>
          <name>dfs.replication</name>
          <value>3</value>
        </property>
        <property>
          <name>dfs.namenode.name.dir</name>
          <value>/data/hadoop/hdfs/namenode</value>
        </property>
        <property>
          <name>dfs.datanode.data.dir</name>
          <value>/data/hadoop/hdfs/datanode</value>
        </property>
      </configuration>
    

  15. Make the data directories.

    sudo mkdir -p /data/hadoop/hdfs/namenode

    sudo mkdir -p /data/hadoop/hdfs/datanode

    sudo chown -R vagrant:vagrant /data/hadoop

  16. In $HADOOP_CONF_DIR/masters, add the following line:

      namenode.example.com
    

  17. In $HADOOP_CONF_DIR/slaves, add the following lines:

      datanode1.example.com
      datanode2.example.com
      datanode3.example.com
      datanode4.example.com
    

  18. Create a ~/.ssh/config file to avoid host key checking for SSH. Since these are DEV servers, this is ok. Note that the indentation here before StrictHostKeyChecking must be a tab.

      Host *
            StrictHostKeyChecking no
    

  19. Now run these commands to finish the password-less authentication.

    chmod 600 ~/.ssh/config

    sudo hostname namenode.example.com

    ssh-keygen -f ~/.ssh/id_rsa -t rsa -P ""

    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

  20. Exit the SSH session and copy the VM for the other hadoop nodes.

    exit

    vagrant halt

    vagrant package

    vagrant box add hadoop ~/vagrant_boxes/hadoop/package.box

  21. Edit the Vagrantfile to look like the following below. This will create 5 Hadoop nodes for us using the new Hadoop VM.

      Vagrant.configure("2") do |config|
        config.vm.define "hadoop-namenode" do |node|
          node.vm.box = "hadoop"
          node.vm.box_url = "namenode.example.com"
          node.vm.hostname = "namenode.example.com"
          node.vm.network :private_network, ip: "192.168.50.21"
          node.ssh.insert_key = false
    
          # Start Hadoop
          node.vm.provision "shell", inline: "hdfs namenode -format -force", privileged: false
          node.vm.provision "shell", inline: "~/hadoop-2.7.3/sbin/start-dfs.sh", privileged: false
          node.vm.provision "shell", inline: "~/hadoop-2.7.3/sbin/start-yarn.sh", privileged: false
          node.vm.provision "shell", inline: "~/hadoop-2.7.3/sbin/mr-jobhistory-daemon.sh start historyserver", privileged: false
        end
    
        (1..4).each do |i|
          config.vm.define "hadoop-datanode#{i}" do |node|
            node.vm.box = "hadoop"
            node.vm.box_url = "datanode#{i}.example.com"
            node.vm.hostname = "datanode#{i}.example.com"
            node.vm.network :private_network, ip: "192.168.50.2#{i+1}"
            node.ssh.insert_key = false
          end
        end
      end
    

  22. Bring the new Vagrant VMs up.

    vagrant up --no-provision

  23. Start Hadoop up on the namenode.

    vagrant provision

To test to see if the Hadoop is working, you can do the following.

First, from you local machine, you should be able to access the Web UI (http://192.168.50.21:50070/). You should see 3 live nodes running. Follow the MapReduce Tutorial to test your cluster further.

The main overview screen of the hadoop admin console.
The main overview screen of the hadoop admin console.