Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)

BIG DATA ANALYTICS (2017 REGULATION)

Hadoop Distributed File System (HDFS):

 Hadoop Distributed File System provides for distributed storage for Hadoop. It is based on Google’s
Filesystem (GFS or GoogleFS). It is designed to run on commodity hardware. HDFS has a master-slave
topology.
 It is Java software that provides many features like scalability, high availability, fault tolerance, cost
effectiveness etc. It also provides robust distributed data storage for Hadoop. We can deploy many other
software frameworks over HDFS.
 The Big Data files get divided into the number of blocks. Hadoop stores these blocks in a distributed fashion
on the cluster of slave nodes. On the master, we have metadata stored.
There are three major components of Hadoop HDFS are as follows:-

The various functions of NameNode are as follows:

 NameNode runs on the master machine.
 It is responsible for maintaining, monitoring and managing DataNodes.
 NameNode assigns tasks to the slave node.
 It records the metadata of the files like the location of blocks, file size, permission, hierarchy etc.
 Namenode captures all the changes to the metadata like deletion, creation and renaming of the file in edit
logs.
 It regularly receives heartbeat and block reports from the DataNodes.
The various functions of DataNode are as follows:

 DataNode runs on the slave machine.
 DataNode is responsible for storing actual data in HDFS
 DataNode also performs read and write operation as per request for the clients.
 DataNode does the ground work of creating, replicating and deleting the blocks on the command of
NameNode.
 After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of HDFS.
MapReduce:
 MapReduce is the data processing component of Hadoop.
 MapReduce programming model is designed for processing large volumes of data in parallel by dividing the
work into a set of independent tasks. You need to put business logic in the way MapReduce works and rest
things will be taken care by the framework.
MapReduce works in two phases:
Map Phase – This phase takes input as key-value pairs and produces output as key-value pairs. It can write custom
business logic in this phase. Map phase processes the data and gives it to the next phase.
Reduce Phase – The MapReduce framework sorts the key-value pair before giving the data to this phase. This phase
applies the summary (aggregation) type of calculations to the key-value pairs.
MapReduce:
 The client specifies the file for input to the Map function. It splits it into tuples.
 Map function defines key and value from the input file. The output of the map function is this key-value
pair.
 MapReduce framework sorts the key-value pair from map function.
 The framework merges the tuples having the same key together.
 The reducers get these merged key-value pairs as input.
 Reducer applies aggregate functions on key-value pair.
 The output from the reducer gets written to HDFS.
 MapReduce framework takes care of the failure. It recovers data from another node in an event where one
node goes down.
Working of MapReduce:
The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing.
YARN:
 Yarn which is short for Yet Another Resource Manager. It is like the operating system of Hadoop as it
monitors and manages the resources.
 Yarn allows different data processing engines like graph processing, interactive processing, stream
processing as well as batch processing to run and process data stored in HDFS.
Yarn has two main components:
Node Manager: It is Yarn’s per-node agent and takes care of the individual compute nodes in a Hadoop cluster.
It monitors the resource usage like CPU, memory etc. of the local node and intimates the same to Resource Manager.
Resource Manager: It is responsible for tracking the resources in the cluster and scheduling tasks like map-reduce
jobs. (runs on the Master machine) It has two components: Scheduler & Application Manager.
Application Master has two functions and they are:-

 Negotiating resources from Resource Manager
 Working with NodeManager to monitor and execute the sub-task.
Following are the functions of Resource Scheduler:-

 It allocates resources to various running applications But it does not monitor the status of the application. So
in the event of failure of the task, it does not restart the same.
 We have another concept called Container. It is nothing but a fraction of NodeManager capacity i.e. CPU,
memory, disk, network etc.

Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)

Uploaded by

Copyright:

Available Formats

Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)

Uploaded by

Copyright:

Available Formats

BIG DATA ANALYTICS (2017 REGULATION)

Hadoop Distributed File System (HDFS):

There are three major components of Hadoop HDFS are as follows:-

The various functions of NameNode are as follows:

The various functions of DataNode are as follows:

MapReduce works in two phases:

Yarn has two main components:

Application Master has two functions and they are:-

Following are the functions of Resource Scheduler:-

You might also like