Apache Hadoop is the implementation of MapReduce and Apache YARN is a job scheduler for Hadoop. YARN will take care of running your Hadoop jobs in parallel on all your worker nodes.
You can run the Quobyte services on the same machines as the Hadoop workers (hyperconverged) or on dedicated machines. Both scenarios have advantages, see our blog post on Dedicated storage nodes or not?
Step 1: Prerequisites
We assume that you have a cluster with three or more nodes:
- Each node has an IP address, if you work on the cloud use the private IP.
- You can resolve the hostname for each node, e.g. with dns or by creating a host file on each node (DNS is preferred in a distributed system to avoid having to create and manage the host files on the nodes).
- You can ssh into each of the nodes and run sudo (or login as root) to do the actual installation.
There will be two roles in the cluster:
- Master Node
runs the YARN resource manager. Since we replaced HDFS with Quobyte there is no need for a NameNode here. The master node can also be a worker node. - Worker nodes
run the NodeManager to execute the jobs. Again, no DataNode required since we use Quobyte instead of HDFS.
Please remember that YARN doesn't come with any security enabled by default. You should make sure that your cluster is not accessible from the outside world.
Step 2: Install and Configure Quobyte
Quobyte cluster installation
If you don't have a Quobyte cluster installed, see our installation quick start here and the recommended hardware here. You can use the installer for both deployment modes: dedicates storage servers or hyperconverged.
Next, make sure to install the Quobyte client on all your hadoop nodes. If you are installing a new Quobyte cluster you can provide a list of client nodes to the Quobyte installer. Create a file called clients.txt with one IP/hostname per line. When the installer asks for the list of clients you can specify this file.
Use the Quobyte installer
If you already have a running Quobyte cluster you can use the Quobyte installer, too. Run
./install_quobyte add-client --qns-id qns-id user@hostname
for each hadoop worker node. The installer tells you the command to add clients and your qns-id:
Step 3: Quobyte Policies for Hadoop
Policy setup
Once your Quobyte cluster is up and running, the devices have been formatted and clients installed you can continue and set the policy for your hadoop volumes. The hadoop policy is designed for any big data application that benefits from data locality, including Apache Spark or HBase. The choice of policy depends on whether you run hyperconverged (worker nodes are also Quobyte storage servers) or with dedicated storage servers.
Policy for Hyperconverged Setups
Go to the Quobyte Webconsole and select the "Policy Rules" tab. Select "Create rule with predefined policy..." from the dropdown
Give your policy a name and add a restriction in the Policy Scope section to limit this policy to volumes that have a Hadoop label.
- Then pick the "Hadoop/Mapreduce on HDDs" policy from the policy definition dropdown.
This policy uses erasure coding to create the redundancy for the files and optimizes placement for hadoop: Stripes are placed locally on the machine that creates the data. All data is stored on HDDs only. Alternatively, you can use the "Hadoop/Mapreduce on SSDs" policy if you have flash. Finally, click on create.
Policy for Dedicated Storage Servers
With dedicated storage servers we don't have the option of storing data locally, as hadoop normally does. We simply use a policy that stores data with erasure coding on HDDs. Alternatively, if you want to use SSDs you can pick the "Automatic EC Redundancy for Files on SSDs" policy.
Create a Quobyte Volume
Next, we have to create a volume to store the data to process and the results. Go to the Volumes tab and select "Create Volume…" from the Volumes dropdown:
Choose a name for your new volume and don't forget to add a label "hadoop" (empty value is ok) so the hadoop policy is used for your new volume.
Finally, get your registry address from the Quobyte UI (you'll need it later):
Step 4: Prepare the Hadoop Install
Before we can start with the actual installation we have to prepare the cluster by installing java, creating a user hadoop and enable ssh login with a key from the master node. Pick any of the machines to be your master node, then execute the follwing steps:
Install the OpenJDK on all nodes, e.g. on centos you just run
sudo yum install -y java-8-openjdk-devel
Install the Quobyte hadoop driver on all nodes:
sudo yum install -y quobyte-hadoop
Create the hadoop user on each node with
This user will be used to install and run Apache Hadoop and YARN
sudo useradd hadoop
YARN uses ssh to control the worker nodes. Create a ssh key for the hadoop user on the master node:
sudo -u hadoop ssh-keygen
Add content on all nodes
Add the contents of /home/hadoop/.ssh/id_rsa.pub to /home/hadoop/.ssh/authorized_keys on all nodes.
Make sure not to overwrite the file and if you have to create the .ssh directory or the authorized_keys file make sure to set the permissions correctly:sudo -u hadoop bash mkdir /home/hadoop/.ssh/ chmod 0700 /home/hadoop/.ssh echo "" >> /home/hadoop/.ssh/authorized_keys chmod 0600 /home/hadoop/.ssh/authorized_keys
ssh from your master node to all other nodes (including the master node itself) to verify that you can log in witout password.
Step 5: Install Hadoop
Finally, we can install Hadoop itself:
Log into the master node as user hadoop
Download hadoop on the master (pick a mirror that works for you here)
wget https://apache-mirror.tld/apache/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Extract the tar archive
tar xzf hadoop-3.3.1.tar.gz
Rename the directory to hadoop
mv hadoop-3.3.1 hadoop
Add hadoop binaries to path for convenience. Add the following line to /home/hadoop/.profile:
PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH
Add the following to your .bashrc file:
Don't froget to run
. ~/.bashrc
to update the PATH variable in your current bash shell.export HADOOP_HOME=/home/hadoop/hadoop export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
Find your path
Find your the path for your
JAVA_HOME
, e.g. withalternatives --display java
. Take the path which says "link currently points to" and remove the trailingjava/bin
Configure theJAVA_HOME
in/home/hadoop/etc/hadoop/hadoop-env.sh
. Add a line like this and replace with your JDKexport JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/jre"
Step 6: Configure Hadoop and YARN
Next, we have to configure Hadoop to use Quobyte as the default file system and YARN. Execute the following steps on the master node.
Log into the master node as user hadoop
Copy the Quobyte driver JAR into the Hadoop directory
cp /opt/quobyte/hadoop/quobyte_hadoop_plugin.jar /home/hadoop/hadoop/share/hadoop/hdfs/
Configure Quobyte as the file system.
Some applications, like YARN, use the AbstractFileSystem - Quobyte provides both drivers. Edit
/home/hadoop/hadoop/etc/hadoop/core-site.xml
and add the following to the<configuration>
section of the file:<property> <name>fs.defaultFS</name> <value>quobyte:///</value> </property> <property> <name>fs.quobyte.impl</name> <value>com.quobyte.hadoop.interfaces.QuobyteFileSystemAdapter</value> </property> <property> <name>fs.AbstractFileSystem.quobyte.impl</name> <value>com.quobyte.hadoop.interfaces.QuobyteAbstractFileSystemAdapter</value> </property> <property> <name>com.quobyte.hadoop.backend</name> <value>JNI</value> </property> <property> <name>com.quobyte.hadoop.registry</name> <!-- Replace this with your registries address --> <value>qns-id.myquobyte.net</value> </property> <property> <name>com.quobyte.hadoop.volume</name> <!-- Replace hadoop-volume with your Quobyte volume name--> <value>demo-1</value> </property>
Configure YARN as the default scheduler.
Add the following to the
<configuration>
section in/home/hadoop/hadoop/etc/hadoop/mapred-site.xml
:<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property>
Add to the configuration section
Configure YARN in /home/hadoop/hadoop/etc/hadoop/yarn-site.xml and add the following to the <configuration> section:
<property> <name>yarn.acl.enable</name> <value>0</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <!-- Replace with your master node address/IP --> <value>Insert Your Master Node's IP address</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
Workers IPs
Put the list of IPs of your workers in
/home/hadoop/hadoop/etc/hadoop/workers
Tar up the hadoop directory, copy to workers and extract there:
cd /home/hadoop tar czf hadoop-configured.tgz hadoop for h in `cat /home/hadoop/hadoop/etc/hadoop/workers` do scp hadoop-configured.tgz $h: ssh $h "tar xzf hadoop-configured.tgz" done
Step 7: Test the Quobyte HDFS Driver
Check your Quobyte driver
Time to check that your Quobyte driver is working properly and is able to contact your Quobyte cluster. Execute the following command on the master node and check that it doesn't report any errors. You'll see some logging output from the Quobyte driver in the console.
hdfs dfs -mkdir -p /testdir
Step 8: Start YARN and run a Test
The last step is to start YARN and submit a mapreduce job to the scheduler.
Start the YARN scheduler on the master node. The master will ssh into the worker nodes and start the YARN processes there.
start-yarn.sh
Ensure that all workers are ready by running this command on the master node:
yarn node -list
Log into any node with a Quobyte client, e.g. the master or any worker. Run the following commands to download sample data to use for our demo. This will download the text of three books:
cd /quobyte/demo-1 mkdir demo-data cd demo-data wget -O two-cities.txt https://www.gutenberg.org/files/98/98-0.txt wget -O metamorphosis.txt https://www.gutenberg.org/files/5200/5200-0.txt wget -O peter-pan.txt https://www.gutenberg.org/files/16/16-0.txt
Submit the mapreduce job to YARN from the master node. This job will count the occurrence of each word in the books:
The output will be on the volume in /quobyte/demo01/wordcount-result
yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar \ wordcount "/demo-data/*" wordcount-result
You can now run distributed hadoop jobs directly on Quobyte and benefit from the easy data sharing between Hadoop, S3 and the Linux world.