How to Set Up a Multi Node Hadoop Cluster with Quobyte

Reading Time: 7 minutes

Apache Hadoop is the implementation of MapReduce and Apache YARN is a job scheduler for Hadoop. YARN will take care of running your Hadoop jobs in parallel on all your worker nodes.

You can run the Quobyte services on the same machines as the Hadoop workers (hyperconverged) or on dedicated machines. Both scenarios have advantages, see our blog post on Dedicated storage nodes or not?

Step 1: Prerequisites

We assume that you have a cluster with three or more nodes:
- Each node has an IP address, if you work on the cloud use the private IP.
- You can resolve the hostname for each node, e.g. with dns or by creating a host file on each node (DNS is preferred in a distributed system to avoid having to create and manage the host files on the nodes).
- You can ssh into each of the nodes and run sudo (or login as root) to do the actual installation.
There will be two roles in the cluster:
- Master Node
  runs the YARN resource manager. Since we replaced HDFS with Quobyte there is no need for a NameNode here. The master node can also be a worker node.
- Worker nodes
  run the NodeManager to execute the jobs. Again, no DataNode required since we use Quobyte instead of HDFS.
Please remember that YARN doesn't come with any security enabled by default. You should make sure that your cluster is not accessible from the outside world.

Step 2: Install and Configure Quobyte

Quobyte cluster installation
If you don't have a Quobyte cluster installed, see our installation quick start here and the recommended hardware here. You can use the installer for both deployment modes: dedicates storage servers or hyperconverged.
Next, make sure to install the Quobyte client on all your hadoop nodes. If you are installing a new Quobyte cluster you can provide a list of client nodes to the Quobyte installer. Create a file called clients.txt with one IP/hostname per line. When the installer asks for the list of clients you can specify this file.
Use the Quobyte installer
If you already have a running Quobyte cluster you can use the Quobyte installer, too. Run ./install_quobyte add-client --qns-id qns-id user@hostname for each hadoop worker node. The installer tells you the command to add clients and your qns-id:

Step 3: Quobyte Policies for Hadoop

Policy setup
Once your Quobyte cluster is up and running, the devices have been formatted and clients installed you can continue and set the policy for your hadoop volumes. The hadoop policy is designed for any big data application that benefits from data locality, including Apache Spark or HBase. The choice of policy depends on whether you run hyperconverged (worker nodes are also Quobyte storage servers) or with dedicated storage servers.
Policy for Hyperconverged Setups
Go to the Quobyte Webconsole and select the "Policy Rules" tab. Select "Create rule with predefined policy..." from the dropdown
Give your policy a name and add a restriction in the Policy Scope section to limit this policy to volumes that have a Hadoop label.
- Then pick the "Hadoop/Mapreduce on HDDs" policy from the policy definition dropdown.
This policy uses erasure coding to create the redundancy for the files and optimizes placement for hadoop: Stripes are placed locally on the machine that creates the data. All data is stored on HDDs only. Alternatively, you can use the "Hadoop/Mapreduce on SSDs" policy if you have flash. Finally, click on create.
Policy for Dedicated Storage Servers
With dedicated storage servers we don't have the option of storing data locally, as hadoop normally does. We simply use a policy that stores data with erasure coding on HDDs. Alternatively, if you want to use SSDs you can pick the "Automatic EC Redundancy for Files on SSDs" policy.
Create a Quobyte Volume
Next, we have to create a volume to store the data to process and the results. Go to the Volumes tab and select "Create Volume…" from the Volumes dropdown:
Choose a name for your new volume and don't forget to add a label "hadoop" (empty value is ok) so the hadoop policy is used for your new volume.
Finally, get your registry address from the Quobyte UI (you'll need it later):

Step 4: Prepare the Hadoop Install

Before we can start with the actual installation we have to prepare the cluster by installing java, creating a user hadoop and enable ssh login with a key from the master node. Pick any of the machines to be your master node, then execute the follwing steps:

Install the OpenJDK on all nodes, e.g. on centos you just run

sudo yum install -y java-8-openjdk-devel
Install the Quobyte hadoop driver on all nodes:

sudo yum install -y quobyte-hadoop
Create the hadoop user on each node with
This user will be used to install and run Apache Hadoop and YARN

sudo useradd hadoop
YARN uses ssh to control the worker nodes. Create a ssh key for the hadoop user on the master node:

sudo -u hadoop ssh-keygen
Add content on all nodes
Add the contents of /home/hadoop/.ssh/id_rsa.pub to /home/hadoop/.ssh/authorized_keys on all nodes.
Make sure not to overwrite the file and if you have to create the .ssh directory or the authorized_keys file make sure to set the permissions correctly:
```
sudo -u hadoop bash
mkdir /home/hadoop/.ssh/
chmod 0700 /home/hadoop/.ssh
echo "" >> /home/hadoop/.ssh/authorized_keys
chmod 0600 /home/hadoop/.ssh/authorized_keys
```
ssh from your master node to all other nodes (including the master node itself) to verify that you can log in witout password.

Step 5: Install Hadoop

Finally, we can install Hadoop itself:

Log into the master node as user hadoop
Download hadoop on the master (pick a mirror that works for you here)

wget https://apache-mirror.tld/apache/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Extract the tar archive

tar xzf hadoop-3.3.1.tar.gz
Rename the directory to hadoop

mv hadoop-3.3.1 hadoop
Add hadoop binaries to path for convenience. Add the following line to /home/hadoop/.profile:
```
PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH
```
Add the following to your .bashrc file:
Don't froget to run . ~/.bashrc to update the PATH variable in your current bash shell.
```
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
```
Find your path
Find your the path for your JAVA_HOME, e.g. with alternatives --display java. Take the path which says "link currently points to" and remove the trailing java/bin Configure the JAVA_HOME in /home/hadoop/etc/hadoop/hadoop-env.sh. Add a line like this and replace with your JDK

export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/jre"

Step 6: Configure Hadoop and YARN

Next, we have to configure Hadoop to use Quobyte as the default file system and YARN. Execute the following steps on the master node.

Log into the master node as user hadoop
Copy the Quobyte driver JAR into the Hadoop directory

cp /opt/quobyte/hadoop/quobyte_hadoop_plugin.jar /home/hadoop/hadoop/share/hadoop/hdfs/

Configure Quobyte as the file system.

Some applications, like YARN, use the AbstractFileSystem - Quobyte provides both drivers. Edit /home/hadoop/hadoop/etc/hadoop/core-site.xml and add the following to the <configuration> section of the file:

<property>
    <name>fs.defaultFS</name>
    <value>quobyte:///</value>
</property>
<property>
  <name>fs.quobyte.impl</name>
  <value>com.quobyte.hadoop.interfaces.QuobyteFileSystemAdapter</value>
</property>
<property>
  <name>fs.AbstractFileSystem.quobyte.impl</name>
<value>com.quobyte.hadoop.interfaces.QuobyteAbstractFileSystemAdapter</value>
</property>
<property>
  <name>com.quobyte.hadoop.backend</name>
  <value>JNI</value>
</property>
<property>
  <name>com.quobyte.hadoop.registry</name>
  <!-- Replace this with your registries address -->
  <value>qns-id.myquobyte.net</value>
</property>
<property>
    <name>com.quobyte.hadoop.volume</name>
    <!-- Replace hadoop-volume with your Quobyte volume name-->
    <value>demo-1</value>
</property>

Configure YARN as the default scheduler.

Add the following to the <configuration> section in /home/hadoop/hadoop/etc/hadoop/mapred-site.xml:

<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
<property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>

Add to the configuration section

Configure YARN in /home/hadoop/hadoop/etc/hadoop/yarn-site.xml and add the following to the <configuration> section:

<property>
        <name>yarn.acl.enable</name>
        <value>0</value>
</property>
<property>
        <name>yarn.resourcemanager.hostname</name>
        <!-- Replace with your master node address/IP -->
        <value>Insert Your Master Node's IP address</value>
</property>
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
</property>

Workers IPs
Put the list of IPs of your workers in /home/hadoop/hadoop/etc/hadoop/workers

Tar up the hadoop directory, copy to workers and extract there:

cd /home/hadoop
tar czf hadoop-configured.tgz hadoop
for h in `cat /home/hadoop/hadoop/etc/hadoop/workers`
do
  scp hadoop-configured.tgz $h:
  ssh $h "tar xzf hadoop-configured.tgz"
done

Step 7: Test the Quobyte HDFS Driver

Check your Quobyte driver
Time to check that your Quobyte driver is working properly and is able to contact your Quobyte cluster. Execute the following command on the master node and check that it doesn't report any errors. You'll see some logging output from the Quobyte driver in the console.
```
hdfs dfs -mkdir -p /testdir
```

Step 8: Start YARN and run a Test

The last step is to start YARN and submit a mapreduce job to the scheduler.

Start the YARN scheduler on the master node. The master will ssh into the worker nodes and start the YARN processes there.

start-yarn.sh
Ensure that all workers are ready by running this command on the master node:
```
yarn node -list
```

Log into any node with a Quobyte client, e.g. the master or any worker. Run the following commands to download sample data to use for our demo. This will download the text of three books:

cd /quobyte/demo-1
mkdir demo-data
cd demo-data
wget -O two-cities.txt https://www.gutenberg.org/files/98/98-0.txt
wget -O metamorphosis.txt https://www.gutenberg.org/files/5200/5200-0.txt
wget -O peter-pan.txt https://www.gutenberg.org/files/16/16-0.txt

Submit the mapreduce job to YARN from the master node. This job will count the occurrence of each word in the books:
The output will be on the volume in /quobyte/demo01/wordcount-result
```
yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar \
wordcount "/demo-data/*" wordcount-result
```

You can now run distributed hadoop jobs directly on Quobyte and benefit from the easy data sharing between Hadoop, S3 and the Linux world.

How to Set Up a Multi Node Hadoop Cluster with Quobyte

Contents

Step 1: Prerequisites

We assume that you have a cluster with three or more nodes:

Step 2: Install and Configure Quobyte

Quobyte cluster installation

Use the Quobyte installer

Step 3: Quobyte Policies for Hadoop

Policy setup

Policy for Hyperconverged Setups

Give your policy a name and add a restriction in the Policy Scope section to limit this policy to volumes that have a Hadoop label.

Policy for Dedicated Storage Servers

Create a Quobyte Volume

Choose a name for your new volume and don't forget to add a label "hadoop" (empty value is ok) so the hadoop policy is used for your new volume.

Finally, get your registry address from the Quobyte UI (you'll need it later):

Step 4: Prepare the Hadoop Install

Install the OpenJDK on all nodes, e.g. on centos you just run

Install the Quobyte hadoop driver on all nodes:

Create the hadoop user on each node with

YARN uses ssh to control the worker nodes. Create a ssh key for the hadoop user on the master node:

Add content on all nodes

ssh from your master node to all other nodes (including the master node itself) to verify that you can log in witout password.

Step 5: Install Hadoop

Log into the master node as user hadoop

Download hadoop on the master (pick a mirror that works for you here)

Extract the tar archive

Rename the directory to hadoop

Add hadoop binaries to path for convenience. Add the following line to /home/hadoop/.profile:

Add the following to your .bashrc file:

Find your path

Step 6: Configure Hadoop and YARN

Log into the master node as user hadoop

Copy the Quobyte driver JAR into the Hadoop directory

Configure Quobyte as the file system.

Configure YARN as the default scheduler.

Add to the configuration section

Workers IPs

Tar up the hadoop directory, copy to workers and extract there:

Step 7: Test the Quobyte HDFS Driver

Check your Quobyte driver

Step 8: Start YARN and run a Test

Start the YARN scheduler on the master node. The master will ssh into the worker nodes and start the YARN processes there.

Ensure that all workers are ready by running this command on the master node:

Log into any node with a Quobyte client, e.g. the master or any worker. Run the following commands to download sample data to use for our demo. This will download the text of three books:

Submit the mapreduce job to YARN from the master node. This job will count the occurrence of each word in the books:

Talk to Us

Subscribe To Our News

Product

Solutions

Resources

Company info

Get in touch

Download Whitepaper