First Hadoop Eclipse Program WordCount Example
Recomendation
First Hadoop2.6 setup must be is there and started ,if not then follow my previous post.
Eclipse or Spring tool suite must be setuped.
Download hadoop eclipse plugin from github
https://github.com/winghc/hadoop2x-eclipse-plugin
Steps:
Copy hadoop plugin to eclipse plugin folder.
Note: choose plugin according to your hadoop version.
Now run your eclipse:
Create a new project-Map Reduce project
hadoop library path is your hadoop home folder location
After finish project create new class name WordCount in source folder.
And copy and paste below code
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, “word count”);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Now setup DFS location in eclipse
Right click on Map/Reduce loation and choose new Hadoop Location
port must be core-sitexml port in my case is 9000.
Now you can see below snap shot
Now create input and output folder using hdfs
same as output directiry on same location and same command
now upload sample test file in input directory using right click on input directory and using upload file option
in my case sample file is wcsample.txt
Welcome to the world of Haddop
Welcome to the world of Haddop
Welcome to the world of Haddop
Welcome to the world of Haddop
Welcome to the world of Haddop
Welcome to the world of Haddop
Welcome to the world of Haddop
Welcome to the world of Haddop
Welcome to the world of Haddop
Welcome to the world of Haddop
after upload right click on WordCount.java run configuration and pass command line arguments like below
then run then go to output location and see output results
Haddop 10
Welcome 10
of 10
the 10
to 10
world 10
Enjoy————
🙂
Hadoop 2.6 setup single node
Apache Hadoop 2.6 significant improvements over the previous stable 2.X.Y releases. This version has many improvements in HDFS and MapReduce. This how to guide will help you to install Hadoop 2.6 on CentOS/RHEL 7/6/5 and Ubuntu System. This article doesn’t includes overall configuration of hadoop, we have only basic configuration required to start working with hadoop.

Step 1: Installing Java
Java is the primary requirement for running hadoop on any system, So make sure you have Java installed on your system using following command.
# java -version java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
If you don’t have Java installed on your system, use one of following link to install it first.
Install Java 8 on CentOS/RHEL 7/6/5
Install Java 8 on Ubuntu
Step 2: Creating Hadoop User
We recommend to create a normal (nor root) account for hadoop working. So create a system account using following command.
# useradd hadoop # passwd hadoop
After creating account, it also required to set up key based ssh to its own account. To do this use execute following commands.
# su - hadoop $ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
Lets verify key based login. Below command should not ask for password but first time it will prompt for adding RSA to the list of known hosts.
$ ssh localhost $ exit
Step 3. Downloading Hadoop 2.6.0
Now download hadoop 2.6.0 source archive file using below command. You can also select alternate download mirror for increasing download speed.
$ cd ~ $ wget http://apache.claz.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz $ tar xzf hadoop-2.6.0.tar.gz $ mv hadoop-2.6.0 hadoop
Step 4. Configure Hadoop Pseudo-Distributed Mode
4.1. Setup Environment Variables
First we need to set environment variable uses by hadoop. Edit ~/.bashrc file and append following values at end of file.
export HADOOP_HOME=/home/hadoop/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply the changes in current running environment
$ source ~/.bashrc
Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable
export JAVA_HOME=/opt/jdk1.8.0_31/
4.2. Edit Configuration Files
Hadoop has many of configuration files, which need to configure as per requirements of your hadoop infrastructure. Lets start with the configuration with basic hadoop single node cluster setup. first navigate to below location
$ cd $HADOOP_HOME/etc/hadoop
Edit core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Edit hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Edit mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Edit yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
4.3. Format Namenode
Now format the namenode using following command, make sure that Storage directory is
$ hdfs namenode -format
Sample output:
15/02/04 09:58:43 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = svr1.tecadmin.net/192.168.1.133 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.6.0 ... ... 15/02/04 09:58:57 INFO common.Storage: Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted. 15/02/04 09:58:57 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 15/02/04 09:58:57 INFO util.ExitUtil: Exiting with status 0 15/02/04 09:58:57 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at svr1.tecadmin.net/192.168.1.133 ************************************************************/
Step 5. Start Hadoop Cluster
Lets start your hadoop cluster using the scripts provides by hadoop. Just navigate to your hadoop sbin directory and execute scripts one by one.
$ cd $HADOOP_HOME/sbin/
Now run start-dfs.sh script.
$ start-dfs.sh
Sample output:
15/02/04 10:00:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [localhost] localhost: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-svr1.tecadmin.net.out localhost: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-svr1.tecadmin.net.out Starting secondary namenodes [0.0.0.0] The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established. RSA key fingerprint is 3c:c4:f6:f1:72:d9:84:f9:71:73:4a:0d:55:2c:f9:43. Are you sure you want to continue connecting (yes/no)? yes 0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts. 0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-svr1.tecadmin.net.out 15/02/04 10:01:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Now run start-yarn.sh script.
$ start-yarn.sh
Sample output:
starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-svr1.tecadmin.net.out localhost: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-svr1.tecadmin.net.out
Step 6. Access Hadoop Services in Browser
Hadoop NameNode started on port 50070 default. Access your server on port 50070 in your favorite web browser.
http://localhost:50070/dfshealth.html#tab-overview http://localhost:8088/cluster Enjoy ............
Time complexity Understanding
Posted by vlakhma in Hadoop, Uncategorized on May 21, 2014
Time complexity Understanding
Posted by vlakhma in Hadoop, Uncategorized on May 21, 2014
1. Introduction
In computer science, the time complexity of an algorithm quantifies the amount of time taken by an algorithm to run as a function of the length of the string representing the input.
2. Big O notation
The time complexity of an algorithm is commonly expressed using big O notation, which excludes coefficients and lower order terms. When expressed this way, the time complexity is said to be described asymptotically, i.e., as the input size goes to infinity.
For example, if the time required by an algorithm on all inputs of size n is at most 5n3 + 3n, the asymptotic time complexity is O(n3). More on that later.
Few more Examples:
- 1 = O(n)
- n = O(n2)
- log(n) = O(n)
- 2 n + 1 = O(n)
3. O(1) Constant Time:
An algorithm is said to run in constant time if it requires the same amount of time regardless of the input size.
Examples:
- array: accessing any element
- fixed-size stack: push and pop methods
- fixed-size queue: enqueue and dequeue methods
4. O(n) Linear Time
An algorithm is said to run in linear time if its time execution is directly proportional to the input size, i.e. time grows linearly as input size increases.
Consider the following examples, below I am linearly searching for an element, this has a time complexity of O(n).
int find = 66;
var numbers = new int[] { 33, 435, 36, 37, 43, 45, 66, 656, 2232 };
for (int i = 0; i < numbers.Length - 1; i++)
{
if(find == numbers[i])
{
return;
}
}
More Examples:
- Array: Linear Search, Traversing, Find minimum etc
- ArrayList: contains method
- Queue: contains method
5. O(log n) Logarithmic Time:
An algorithm is said to run in logarithmic time if its time execution is proportional to the logarithm of the input size.
Example: Binary Search
Recall the “twenty questions” game – the task is to guess the value of a hidden number in an interval. Each time you make a guess, you are told whether your guess is too high or too low. Twenty questions game implies a strategy that uses your guess number to halve the interval size. This is an example of the general problem-solving method known as binary search
6. O(n2) Quadratic Time
An algorithm is said to run in quadratic time if its time execution is proportional to the square of the input size.
Examples:
7. Some Useful links
Cassendra Setup at local system
Posted by vlakhma in Hadoop, Uncategorized on April 23, 2014
//
Download cassandra from
http://cassandra.apache.org/
After extracting Cassandra to some folder (on my Windows box I placed it directly in D:\cassandra), the only file you need to edit is conf/storage-conf.xml. While Cassandra is engineered to run on a large number of machines in a network, we start it here as a single node with the default parameter set, so that most of the settings are ok for now.
If your are not on a Unix-like system, you need to update the folders where Cassandra is supposed to store the data. If your using Windows (like me), then find the following lines in conf/storage-conf.xml and change the paths to something sensible
<CommitLogDirectory>/var/lib/cassandra/commitlog</CommitLogDirectory> <DataFileDirectories> <DataFileDirectory>/var/lib/cassandra/data</DataFileDirectory> </DataFileDirectories> <CalloutLocation>/var/lib/cassandra/callouts</CalloutLocation> <BootstrapFileDirectory>/var/lib/cassandra/bootstrap</BootstrapFileDirectory> <StagingFileDirectory>/var/lib/cassandra/staging</StagingFileDirectory> |
like for example my settings:
<CommitLogDirectory>D:/cassandra/data/commitlog</CommitLogDirectory> <DataFileDirectories> <DataFileDirectory>D:/cassandra/data/data</DataFileDirectory> </DataFileDirectories> <CalloutLocation>D:/cassandra/data/callouts</CalloutLocation> <BootstrapFileDirectory>D:/cassandra/data/bootstrap</BootstrapFileDirectory> <StagingFileDirectory>D:/cassandra/data/staging</StagingFileDirectory> |
Let’s take Cassandra for a spin and check if she starts up correctly. For Mac OS, Linux, etc. users, simply change to the bin directory of Cassandra and run ./cassandra. As an aside for the impatient, I start Cassanda with sudo to avoid trouble with the Cassandras system.log.
Windows users, however, that use the command line (meaning not Cygwin) cannot start it just like that. The cassandra.bat didnt work for me on my Vista box if executed with bin being the current working directory (probably due to the CASSANDRA_HOME environment variable that get’s incorrectly set in the batch file). BUT it works perfect if you call bin\cassandra.bat from Cassandra’s main directory above bin. So if you are on Windows, change to the directory where you extracted Cassandra and execute bin\cassandra.bat.
After all setuped correctly:–
Go to your CASSANDRA_HOME/bin folder and type sudo ./cassandra on your terminal
then you will see below prpcessing
vishal@vishal-Compaq-610 ~/cassendra/apache-cassandra-1.2.1 $ cd bin
vishal@vishal-Compaq-610 ~/cassendra/apache-cassandra-1.2.1/bin $ sudo ./cassandra
xss = -ea -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1024M -Xmx1024M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss180k
vishal@vishal-Compaq-610 ~/cassendra/apache-cassandra-1.2.1/bin $ INFO 23:14:52,960 Logging initialized
INFO 23:14:53,001 JVM vendor/version: Java HotSpot(TM) Server VM/1.7.0_04
INFO 23:14:53,001 Heap size: 1052770304/1052770304
INFO 23:14:53,002 Classpath: ./../conf:./../build/classes/main:./../build/classes/thrift:./../lib/antlr-3.2.jar:./../lib/apache-cassandra-1.2.1.jar:./../lib/apache-cassandra-clientutil-1.2.1.jar:./../lib/apache-cassandra-thrift-1.2.1.jar:./../lib/avro-1.4.0-fixes.jar:./../lib/avro-1.4.0-sources-fixes.jar:./../lib/commons-cli-1.1.jar:./../lib/commons-codec-1.2.jar:./../lib/commons-lang-2.6.jar:./../lib/compress-lzf-0.8.4.jar:./../lib/concurrentlinkedhashmap-lru-1.3.jar:./../lib/guava-13.0.1.jar:./../lib/high-scale-lib-1.1.2.jar:./../lib/jackson-core-asl-1.9.2.jar:./../lib/jackson-mapper-asl-1.9.2.jar:./../lib/jamm-0.2.5.jar:./../lib/jline-1.0.jar:./../lib/json-simple-1.1.jar:./../lib/libthrift-0.7.0.jar:./../lib/log4j-1.2.16.jar:./../lib/metrics-core-2.0.3.jar:./../lib/netty-3.5.9.Final.jar:./../lib/servlet-api-2.5-20081211.jar:./../lib/slf4j-api-1.7.2.jar:./../lib/slf4j-log4j12-1.7.2.jar:./../lib/snakeyaml-1.6.jar:./../lib/snappy-java-1.0.4.1.jar:./../lib/snaptree-0.1.jar:./../lib/jamm-0.2.5.jar
INFO 23:14:53,005 JNA not found. Native methods will be disabled.
INFO 23:14:53,038 Loading settings from file:/home/vishal/cassendra/apache-cassandra-1.2.1/conf/cassandra.yaml
INFO 23:14:53,728 32bit JVM detected. It is recommended to run Cassandra on a 64bit JVM for better performance.
INFO 23:14:53,728 DiskAccessMode ‘auto’ determined to be standard, indexAccessMode is standard
INFO 23:14:53,728 disk_failure_policy is stop
INFO 23:14:53,736 Global memtable threshold is enabled at 334MB
INFO 23:14:54,586 Initializing key cache with capacity of 50 MBs.
INFO 23:14:54,601 Scheduling key cache save to each 14400 seconds (going to save all keys).
INFO 23:14:54,603 Initializing row cache with capacity of 0 MBs and provider org.apache.cassandra.cache.SerializingCacheProvider
INFO 23:14:54,613 Scheduling row cache save to each 0 seconds (going to save all keys).
INFO 23:14:54,811 Opening /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ib-1 (260 bytes)
INFO 23:14:54,814 Opening /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ib-2 (260 bytes)
INFO 23:14:54,885 Opening /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ib-3 (259 bytes)
INFO 23:14:54,954 Opening /var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-ib-3 (4428 bytes)
INFO 23:14:54,955 Opening /var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-ib-1 (4429 bytes)
INFO 23:14:54,984 Opening /var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-ib-2 (4428 bytes)
INFO 23:14:55,021 Opening /var/lib/cassandra/data/system/schema_columns/system-schema_columns-ib-1 (3747 bytes)
INFO 23:14:55,021 Opening /var/lib/cassandra/data/system/schema_columns/system-schema_columns-ib-3 (3750 bytes)
INFO 23:14:55,024 Opening /var/lib/cassandra/data/system/schema_columns/system-schema_columns-ib-2 (3749 bytes)
INFO 23:14:55,065 Opening /var/lib/cassandra/data/system/local/system-local-ib-18 (433 bytes)
INFO 23:14:55,068 Opening /var/lib/cassandra/data/system/local/system-local-ib-17 (109 bytes)
INFO 23:14:55,080 Opening /var/lib/cassandra/data/system/local/system-local-ib-16 (120 bytes)
INFO 23:14:55,649 Opening /var/lib/cassandra/data/system_auth/users/system_auth-users-ib-1 (72 bytes)
INFO 23:14:55,723 completed pre-loading (3 keys) key cache.
INFO 23:14:55,732 Replaying /var/lib/cassandra/commitlog/CommitLog-2-1362197776686.log, /var/lib/cassandra/commitlog/CommitLog-2-1362197776687.log
INFO 23:14:55,773 Replaying /var/lib/cassandra/commitlog/CommitLog-2-1362197776686.log
INFO 23:14:55,928 Finished reading /var/lib/cassandra/commitlog/CommitLog-2-1362197776686.log
INFO 23:14:55,929 Replaying /var/lib/cassandra/commitlog/CommitLog-2-1362197776687.log
INFO 23:14:55,936 Finished reading /var/lib/cassandra/commitlog/CommitLog-2-1362197776687.log
INFO 23:14:55,945 Enqueuing flush of Memtable-local@26768625(52/52 serialized/live bytes, 2 ops)
INFO 23:14:55,978 Writing Memtable-local@26768625(52/52 serialized/live bytes, 2 ops)
INFO 23:14:56,016 Enqueuing flush of Memtable-schema_keyspaces@9678250(389/389 serialized/live bytes, 12 ops)
INFO 23:14:56,025 Enqueuing flush of Memtable-schema_columns@24029314(21317/21317 serialized/live bytes, 328 ops)
INFO 23:14:56,026 Enqueuing flush of Memtable-schema_columnfamilies@33105121(20754/20754 serialized/live bytes, 344 ops)
INFO 23:14:56,169 Completed flushing /var/lib/cassandra/data/system/local/system-local-ib-19-Data.db (84 bytes) for commitlog position ReplayPosition(segmentId=1362851095627, position=142)
INFO 23:14:56,184 Writing Memtable-schema_keyspaces@9678250(389/389 serialized/live bytes, 12 ops)
INFO 23:14:56,354 Completed flushing /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ib-4-Data.db (261 bytes) for commitlog position ReplayPosition(segmentId=1362851095627, position=142)
INFO 23:14:56,358 Writing Memtable-schema_columns@24029314(21317/21317 serialized/live bytes, 328 ops)
INFO 23:14:56,480 Completed flushing /var/lib/cassandra/data/system/schema_columns/system-schema_columns-ib-4-Data.db (3757 bytes) for commitlog position ReplayPosition(segmentId=1362851095627, position=142)
INFO 23:14:56,480 Writing Memtable-schema_columnfamilies@33105121(20754/20754 serialized/live bytes, 344 ops)
INFO 23:14:56,598 Completed flushing /var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-ib-4-Data.db (4426 bytes) for commitlog position ReplayPosition(segmentId=1362851095627, position=142)
INFO 23:14:56,599 Log replay complete, 18 replayed mutations
INFO 23:14:56,874 Cassandra version: 1.2.1
INFO 23:14:56,877 Thrift API version: 19.35.0
INFO 23:14:56,877 CQL supported versions: 2.0.0,3.0.1 (default: 3.0.1)
INFO 23:14:56,934 Loading persisted ring state
INFO 23:14:56,939 Starting up server gossip
INFO 23:14:56,964 Enqueuing flush of Memtable-local@15217892(251/251 serialized/live bytes, 9 ops)
INFO 23:14:56,965 Writing Memtable-local@15217892(251/251 serialized/live bytes, 9 ops)
INFO 23:14:57,100 Completed flushing /var/lib/cassandra/data/system/local/system-local-ib-20-Data.db (239 bytes) for commitlog position ReplayPosition(segmentId=1362851095627, position=48998)
INFO 23:14:57,116 Compacting [SSTableReader(path=’/var/lib/cassandra/data/system/local/system-local-ib-16-Data.db’), SSTableReader(path=’/var/lib/cassandra/data/system/local/system-local-ib-17-Data.db’), SSTableReader(path=’/var/lib/cassandra/data/system/local/system-local-ib-19-Data.db’), SSTableReader(path=’/var/lib/cassandra/data/system/local/system-local-ib-18-Data.db’), SSTableReader(path=’/var/lib/cassandra/data/system/local/system-local-ib-20-Data.db’)]
INFO 23:14:57,212 Starting Messaging Service on port 7000
INFO 23:14:57,259 Using saved token [6549584977218811532]
INFO 23:14:57,261 Enqueuing flush of Memtable-local@29628776(84/84 serialized/live bytes, 4 ops)
INFO 23:14:57,261 Writing Memtable-local@29628776(84/84 serialized/live bytes, 4 ops)
INFO 23:14:57,386 Completed flushing /var/lib/cassandra/data/system/local/system-local-ib-21-Data.db (120 bytes) for commitlog position ReplayPosition(segmentId=1362851095627, position=49273)
INFO 23:14:57,392 Enqueuing flush of Memtable-local@31749884(50/50 serialized/live bytes, 2 ops)
INFO 23:14:57,393 Writing Memtable-local@31749884(50/50 serialized/live bytes, 2 ops)
INFO 23:14:57,511 Completed flushing /var/lib/cassandra/data/system/local/system-local-ib-22-Data.db (109 bytes) for commitlog position ReplayPosition(segmentId=1362851095627, position=49447)
INFO 23:14:57,543 Node localhost/127.0.0.1 state jump to normal
INFO 23:14:57,545 Startup completed! Now serving reads.
After successfully startup open new terminal or new tab :
and then type ./cassandra-cli in your cassandra home bin folder then you will see
Connected to: “Test Cluster” on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.1
Type ‘help;’ or ‘?’ for help.
Type ‘quit;’ or ‘exit;’ to quit.
[default@unknown]
Your cassandra setup and startup done 🙂
enjoyyyyyyyyyy
Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
In this tutorial I will describe the required steps for setting up a distributed, multi-node Apache Hadoop cluster backed by the Hadoop Distributed File System (HDFS), running on Ubuntu Linux.
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to
In a previous tutorial, I described how to setup up a Hadoop single-node cluster on an Ubuntu box. The main goal of this tutorial is to get a more sophisticated Hadoop installation up and running, namely building a multi-node cluster using two Ubuntu boxes.
This tutorial has been tested with the following software versions:
- Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04)
- Hadoop 1.0.3, released May 2012

Tutorial approach and structure
From two single-node clusters to a multi-node cluster – We will build a multi-node cluster using two Ubuntu boxes in this tutorial. In my humble opinion, the best way to do this for starters is to install, configure and test a “local” Hadoop setup for each of the two Ubuntu boxes, and in a second step to “merge” these two single-node clusters into one multi-node cluster in which one Ubuntu box will become the designated master (but also act as a slave with regard to data storage and processing), and the other box will become only a slave. It’s much easier to track down any problems you might encounter due to the reduced complexity of doing a single-node cluster setup first on each machine.

Let’s get started!
Prerequisites
Configuring single-node clusters first
The tutorial approach outlined above means that you should read now my previous tutorial on how to setup up a Hadoop single-node cluster and follow the steps described there to build a single-node Hadoop cluster on each of the two Ubuntu boxes. It is recommended that you use the ‘‘same settings’’ (e.g., installation locations and paths) on both machines, or otherwise you might run into problems later when we will migrate the two machines to the final multi-node cluster setup.
Just keep in mind when setting up the single-node clusters that we will later connect and “merge” the two machines, so pick reasonable network settings etc. now for a smooth transition later.
Done? Let’s continue then!
Now that you have two single-node clusters up and running, we will modify the Hadoop configuration to make one Ubuntu box the “master” (which will also act as a slave) and the other Ubuntu box a “slave”.
Shutdown each single-node cluster with bin/stop-all.sh before continuing if you haven’t done so already.
Networking
This should come hardly as a surprise, but for the sake of completeness I have to point out that both machines must be able to reach each other over the network. The easiest is to put both machines in the same network with regard to hardware and software configuration, for example connect both machines via a single hub or switch and configure the network interfaces to use a common network such as 192.168.0.x/24.
To make it simple, we will assign the IP address 192.168.0.1 to the master machine and 192.168.0.2 to the slave machine. Update /etc/hosts on both machines with the following lines:
1 2 |
|
SSH access
The hduser user on the master (aka hduser@master) must be able to connect a) to its own user account on the master – i.e. ssh master in this context and not necessarily ssh localhost – and b) to the hduser user account on the slave (aka hduser@slave) via a password-less SSH login. If you followed my single-node cluster tutorial, you just have to add the hduser@master’s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to the authorized_keys file of hduser@slave (in this user’s $HOME/.ssh/authorized_keys). You can do this manually or use the following SSH command:
1
|
|
This command will prompt you for the login password for user hduser on slave, then copy the public SSH key for you, creating the correct directory and fixing the permissions as necessary.
The final step is to test the SSH setup by connecting with user hduser from the master to the user account hduser on the slave. The step is also needed to save slave’s host key fingerprint to the hduser@master’s known_hosts file.
So, connecting from master to master…
1 2 3 4 5 6 7 8 |
|
…and from master to slave.
1 2 3 4 5 6 7 8 |
|
Hadoop
Cluster Overview (aka the goal)
The next sections will describe how to configure one Ubuntu box as a master node and the other Ubuntu box as a slave node. The master node will also act as a slave because we only have two machines available in our cluster but still want to spread data storage and processing to multiple machines.

The master node will run the “master” daemons for each layer: NameNode for the HDFS storage layer, and JobTracker for the MapReduce processing layer. Both machines will run the “slave” daemons: DataNode for the HDFS layer, and TaskTracker for MapReduce processing layer. Basically, the “master” daemons are responsible for coordination and management of the “slave” daemons while the latter will do the actual data storage and data processing work.
Masters vs. Slaves
Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. These are the actual “master nodes”. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves or “worker nodes”.
Configuration
conf/masters (master only)
Despite its name, the conf/masters file defines on which machines Hadoop will start secondary NameNodes in our multi-node cluster. In our case, this is just the master machine. The primary NameNode and the JobTracker will always be the machines on which you run the bin/start-dfs.sh and bin/start-mapred.sh scripts, respectively (the primary NameNode and the JobTracker will be started on the same machine if you run bin/start-all.sh).
Here are more details regarding the conf/masters file:
The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode. The secondary NameNode is started by “bin/start-dfs.sh“ on the nodes specified in “conf/masters“ file.
Again, the machine on which bin/start-dfs.sh is run will become the primary NameNode.
On master, update conf/masters that it looks like this:
1
|
|
conf/slaves (master only)
The conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers) will be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.
On master, update conf/slaves that it looks like this:
1 2 |
|
If you have additional slave nodes, just add them to the conf/slaves file, one hostname per line.
1 2 3 4 5 |
|
conf/*-site.xml (all machines)
You must change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines as follows.
First, we have to change the fs.default.name parameter (in conf/core-site.xml), which specifies the NameNode (the HDFS master) host and port. In our case, this is the master machine.
1 2 3 4 5 6 7 8 9 |
|
Second, we have to change the mapred.job.tracker parameter (in conf/mapred-site.xml), which specifies the JobTracker (MapReduce master) host and port. Again, this is the master in our case.
1 2 3 4 5 6 7 8 |
|
Third, we change the dfs.replication parameter (in conf/hdfs-site.xml) which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. If you set this to a value higher than the number of available slave nodes (more precisely, the number of DataNodes), you will start seeing a lot of “(Zero targets found, forbidden1.size=1)” type errors in the log files.
The default value of dfs.replication is 3. However, we have only two nodes available, so we set dfs.replication to 2.
1 2 3 4 5 6 7 8 |
|
Additional Settings
There are some other configuration options worth studying. The following information is taken from the Hadoop API Overview.
In file conf/mapred-site.xml:
- “mapred.local.dir“
- Determines where temporary MapReduce data is written. It also may be a list of directories.
- “mapred.map.tasks“
- As a rule of thumb, use 10x the number of slaves (i.e., number of TaskTrackers).
- “mapred.reduce.tasks“
- As a rule of thumb, use num_tasktrackers * num_reduce_slots_per_tasktracker * 0.99. If num_tasktrackers is small (as in the case of this tutorial), use (num_tasktrackers – 1) * num_reduce_slots_per_tasktracker.
Formatting the HDFS filesystem via the NameNode
Before we start our new multi-node cluster, we must format Hadoop’s distributed filesystem (HDFS) via the NameNode. You need to do this the first time you set up an Hadoop cluster.
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable on the NameNode), run the command
1 2 3 |
|
Background: The HDFS name table is stored on the NameNode’s (here: master) local filesystem in the directory specified by dfs.name.dir. The name table is used by the NameNode to store tracking and coordination information for the DataNodes.
Starting the multi-node cluster
Starting the cluster is performed in two steps.
- We begin with starting the HDFS daemons: the NameNode daemon is started on
master, and DataNode daemons are started on all slaves (here:masterandslave). - Then we start the MapReduce daemons: the JobTracker is started on
master, and TaskTracker daemons are started on all slaves (here:masterandslave).
HDFS daemons
Run the command bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.
In our case, we will run bin/start-dfs.sh on master:
1 2 3 4 5 6 7 |
|
On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-hduser-datanode-slave.log.
Example output:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
As you can see in slave’s output above, it will automatically format its storage directory (specified by the dfs.data.dir parameter) if it is not formatted already. It will also create the directory if it does not exist yet.
At this point, the following Java processes should run on master…
1 2 3 4 5 6 |
|
(the process IDs don’t matter of course)
…and the following on slave.
1 2 3 4 |
|
MapReduce daemons
Run the command bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.
In our case, we will run bin/start-mapred.sh on master:
1 2 3 4 5 6 |
|
On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-hduser-tasktracker-slave.log. Example output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
At this point, the following Java processes should run on master…
1 2 3 4 5 6 7 8 |
|
(the process IDs don’t matter of course)
…and the following on slave.
1 2 3 4 5 |
|
Stopping the multi-node cluster
Like starting the cluster, stopping it is done in two steps. The workflow however is the opposite of starting.
- We begin with stopping the MapReduce daemons: the JobTracker is stopped on
master, and TaskTracker daemons are stopped on all slaves (here:masterandslave). - Then we stop the HDFS daemons: the NameNode daemon is stopped on
master, and DataNode daemons are stopped on all slaves (here:masterandslave).
MapReduce daemons
Run the command bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.
In our case, we will run bin/stop-mapred.sh on master:
1 2 3 4 5 6 |
|
At this point, the following Java processes should run on master…
1 2 3 4 5 6 |
|
…and the following on slave.
1 2 3 4 |
|
HDFS daemons
Run the command bin/stop-dfs.sh on the NameNode machine. This will shut down HDFS by stopping the NameNode daemon running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.
In our case, we will run bin/stop-dfs.sh on master:
1 2 3 4 5 6 7 |
|
(again, the output above might suggest that the NameNode was running and stopped on slave, but you can be assured that the NameNode ran on master)
At this point, the only following Java processes should run on master…
1 2 3 |
|
…and the following on slave.
1 2 3 |
|
Running a MapReduce job
Just follow the steps described in the section Running a MapReduce job of the single-node cluster tutorial.
I recommend however that you use a larger set of input data so that Hadoop will start several Map and Reduce tasks, and in particular, on both master and slave. After all this installation and configuration work, we want to see the job processed by all machines in the cluster, don’t we?
Here’s the example input data I have used for the multi-node cluster setup described in this tutorial. I added four more Project Gutenberg etexts to the initial three documents mentioned in the single-node cluster tutorial. All etexts should be in plain text us-ascii encoding.
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
- The Art of War by 6th cent. B.C. Sunzi
- The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
- The Devil’s Dictionary by Ambrose Bierce
- Encyclopaedia Britannica, 11th Edition, Volume 4, Part 3
Download these etexts, copy them to HDFS, run the WordCount example MapReduce job on master, and retrieve the job result from HDFS to your local filesystem.
Here’s the example output on master… after executing the MapReduce job…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
…and the logging output on slave for its DataNode daemon…
1 2 3 4 5 6 7 8 9 10 11 12 |
|
…and on slave for its TaskTracker daemon.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
If you want to inspect the job’s output data, you need to retrieve the job results from HDFS to your local file system (see instructions in the single-node cluster tutorial.
Caveats
java.io.IOException: Incompatible namespaceIDs
If you observe the error “java.io.IOException: Incompatible namespaceIDs” in the logs of a DataNode (logs/hadoop-hduser-datanode-.log), chances are you are affected by issue HDFS-107 (formerly known as HADOOP-1212).
The full error looked like this on my machines:
1 2 3 4 5 6 7 8 9 |
|
There are basically two solutions to fix this error as I will describe below.
Solution 1: Start from scratch
This step fixes the problem at the cost of erasing all existing data in the cluster’s HDFS file system.
- Stop the full cluster, i.e. both MapReduce and HDFS layers.
- Delete the data directory on the problematic DataNode: the directory is specified by
dfs.data.dirinconf/hdfs-site.xml; if you followed this tutorial, the relevant directory is/app/hadoop/tmp/dfs/data. - Reformat the NameNode. WARNING: all HDFS data is lost during this process!
- Restart the cluster.
When deleting all the HDFS data and starting from scratch does not sound like a good idea (it might be ok during the initial setup/testing), you might give the second approach a try.
Solution 2: Manually update the namespaceID of problematic DataNodes
Big thanks to Jared Stehler for the following suggestion. This workaround is “minimally invasive” as you only have to edit a single file on the problematic DataNodes:
- Stop the problematic DataNode(s).
- Edit the value of
namespaceIDin${dfs.data.dir}/current/VERSIONto match the corresponding value of the current NameNode in${dfs.name.dir}/current/VERSION. - Restart the fixed DataNode(s).
If you followed the instructions in my tutorials, the full paths of the relevant files are:
- NameNode: /app/hadoop/tmp/dfs/name/current/VERSION
- DataNode: /app/hadoop/tmp/dfs/data/current/VERSION (background:
dfs.data.diris by default set to${hadoop.tmp.dir}/dfs/data, and we sethadoop.tmp.dirin this tutorial to/app/hadoop/tmp).
If you wonder how the contents of VERSION look like, here’s one of mine:
1 2 3 4 5 |
|
Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Posted by vlakhma in Hadoop, Uncategorized on April 16, 2014
In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.
This tutorial has been tested with the following software versions:
- Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04)
- Hadoop 1.0.3, released May 2012

Prerequisites
Sun Java 6
Hadoop requires a working Java 1.5+ (aka Java 5) installation. However, using Java 1.6 (aka Java 6) is recommended for running Hadoop. For the sake of this tutorial, I will therefore describe the installation of Java 1.6.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
The full JDK which will be placed in /usr/lib/jvm/java-6-sun (well, this directory is actually a symlink on Ubuntu).
After installation, make a quick check whether Sun’s JDK is correctly set up:
1 2 3 4 |
|
Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).
1 2 |
|
This will add the user hduser and the group hadoop to your local machine.
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.
I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several online guides available.
First, we have to generate an SSH key for the hduser user.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).
Second, you have to enable SSH access to your local machine with this newly created key.
1
|
|
The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).
1 2 3 4 5 6 7 8 9 |
|
If the SSH connect should fail, these general tips might help:
- Enable debugging with
ssh -vvv localhostand investigate the error in detail. - Check the SSH server configuration in
/etc/ssh/sshd_config, in particular the optionsPubkeyAuthentication(which should be set toyes) andAllowUsers(if this option is active, add thehduseruser to it). If you made any changes to the SSH server configuration file, you can force a configuration reload withsudo /etc/init.d/ssh reload.
Disabling IPv6
One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:
1 2 3 4 |
|
You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following command:
1
|
|
A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).
Alternative
You can also disable IPv6 only for Hadoop as documented in HADOOP-3437. You can do so by adding the following line to conf/hadoop-env.sh:
1
|
|
Hadoop
Installation
Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:
1 2 3 4 |
|
(Just to give you the idea, YMMV – personally, I create a symlink from hadoop-1.0.3 to hadoop.)
Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
You can repeat this exercise also for other users who want to use Hadoop.
Excursus: Hadoop Distributed File System (HDFS)
Before we continue let us briefly learn a bit more about Hadoop’s distributed file system.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.
The following picture gives an overview of the most important HDFS components.

Configuration
Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.
hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.
Change
1 2 |
|
to
1 2 |
|
Note: If you are on a Mac with OS X 10.7 you can use the following line to set up JAVA_HOME in conf/hadoop-env.sh.
1 2 |
|
conf/*-site.xml
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.
You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.
Now we create the directory and set the required ownerships and permissions:
1 2 3 4 |
|
If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section).
Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
In file conf/core-site.xml:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
In file conf/mapred-site.xml:
1 2 3 4 5 6 7 8 |
|
In file conf/hdfs-site.xml:
1 2 3 4 5 6 7 8 |
|
See Getting Started with Hadoop and the documentation in Hadoop’s API Overview if you have any questions about Hadoop’s configuration options.
Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command
1
|
|
The output will look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Starting your single-node cluster
Run the command:
1
|
|
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:
1 2 3 4 5 6 7 |
|
A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0). See also How to debug MapReduce programs.
1 2 3 4 5 6 7 |
|
You can also check with netstat if Hadoop is listening on the configured ports.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
If there are any errors, examine the log files in the /logs/ directory.
Stopping your single-node cluster
Run the command
1
|
|
to stop all the daemons running on your machine.
Example output:
1 2 3 4 5 6 7 |
|
Running a MapReduce job
We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki.
Download example input data
We will use three ebooks from Project Gutenberg for this example:
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg.
1 2 3 4 5 6 |
|
Restart the Hadoop cluster
Restart your Hadoop cluster if it’s not running already.
1
|
|
Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.
1 2 3 4 5 6 7 8 9 10 |
|
Run the MapReduce job
Now, we actually run the WordCount example job.
1
|
|
This command will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the result in the HDFS directory /user/hduser/gutenberg-output.
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop*examples*.jar at org.apache.hadoop.util.RunJar.main (RunJar.java: 90) Caused by: java.util.zip.ZipException: error in opening zip file
In this case, re-run the command with the full name of the Hadoop Examples JAR file, for example:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Example output of the previous command in the console:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:
1 2 3 4 5 6 7 8 9 |
|
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:
1
|
|
An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.
Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command
1
|
|
to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Note that in this specific output the quote signs (“) enclosing the words in the head output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.
The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
- http://localhost:50070/ – web UI of the NameNode daemon
- http://localhost:50030/ – web UI of the JobTracker daemon
- http://localhost:50060/ – web UI of the TaskTracker daemon
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.
NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.
By default, it’s available at http://localhost:50070/.

JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).
By default, it’s available at http://localhost:50030/.
![]()
TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.
By default, it’s available at http://localhost:50060/.
Your message has been sent
Java Api
Posted by vlakhma in Uncategorized on December 15, 2016
The Challenge: Develop a spring boot (https://spring.io/guides/gs/spring-boot/) server side API that satisfies the following acceptance criteria: A Retail Manager (using a RESTful client e.g. Chrome’s Postman), can post a JSON payload to a shops endpoint containing details of the shop they want to add: o shopName o shopAddress.number o shopAddress.postCode The server side API uses the Google Maps Geocoding API to retrieve the longitude and latitude for the provided shopAddress The server side API then stores details of the shop to an in-memory array of shops. Details should contain: o shopName o shopAddress.number o shopAddress.postCode o shopLongitude o shopLatitude A Customer (using a RESTful client e.g. Chrome’s Postman), can get a JSON payload from the shops endpoint containing details of the shop nearest to them: o shopName o shopAddress.number o shopAddress.postCode o shopLongitude o shopLatitude The server side API accepts two URL parameters for the request: o customerLongitude o customerLatitude The server side API determines the nearest shop using the Customer’s longitude and latitude. Our Guidance: We wouldn’t ask anyone to do a coding puzzle that we haven’t done ourselves, so the people looking at your code understand the problem we’re asking to be solved. It took us around 4 – 6 hours of effort when doing this, so please use this as an indicator of effort required. We are keen to see how much you think is enough, and how much would go into a Minimum Viable Product. As a guide, elegant and simple wins over feature rich every time. Do you test drive your code? This is something we value. We want to see that you are familiar and able to create a build process for your code that means it is easy to build, test and run. Any indicator of design (DDD, or design patterns) would make us smile, as well as BDD and TDD approaches to dev. We currently use Gradle as our build runner. Please provide us with build, test and run instructions as well as the codebase, preferably via git (e.g. github) showing multiple commits, so we can see the evolution of your codebase. Please document the server side API so we understand how to use it. We also consider the extensibility of the code produced. Well factored code should be relatively easily extended. Some questions we’ll probably ask: How would you make your server side API production ready? How you would expand this solution, given a longer development period? How would you integrate this solution into an existing collection of





Recent Comments