Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 7

BIG DATA ANALYTICS (2017 REGULATION)

Hadoop Distributed File System (HDFS):


 Hadoop Distributed File System provides for distributed storage for Hadoop. It is based on Google’s
Filesystem (GFS or GoogleFS). It is designed to run on commodity hardware. HDFS has a master-slave
topology.
 It is Java software that provides many features like scalability, high availability, fault tolerance, cost
effectiveness etc. It also provides robust distributed data storage for Hadoop. We can deploy many other
software frameworks over HDFS.
 The Big Data files get divided into the number of blocks. Hadoop stores these blocks in a distributed fashion
on the cluster of slave nodes. On the master, we have metadata stored.

There are three major components of Hadoop HDFS are as follows:-


BIG DATA ANALYTICS (2017 REGULATION)

The various functions of NameNode are as follows:


 NameNode runs on the master machine.
 It is responsible for maintaining, monitoring and managing DataNodes.
 NameNode assigns tasks to the slave node.
 It records the metadata of the files like the location of blocks, file size, permission, hierarchy etc.
 Namenode captures all the changes to the metadata like deletion, creation and renaming of the file in edit
logs.
 It regularly receives heartbeat and block reports from the DataNodes.

The various functions of DataNode are as follows:


 DataNode runs on the slave machine.
 DataNode is responsible for storing actual data in HDFS
 DataNode also performs read and write operation as per request for the clients.
 DataNode does the ground work of creating, replicating and deleting the blocks on the command of
NameNode.
 After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of HDFS.
BIG DATA ANALYTICS (2017 REGULATION)

MapReduce:
 MapReduce is the data processing component of Hadoop.
 MapReduce programming model is designed for processing large volumes of data in parallel by dividing the
work into a set of independent tasks. You need to put business logic in the way MapReduce works and rest
things will be taken care by the framework.

MapReduce works in two phases:

Map Phase – This phase takes input as key-value pairs and produces output as key-value pairs. It can write custom
business logic in this phase. Map phase processes the data and gives it to the next phase.

Reduce Phase – The MapReduce framework sorts the key-value pair before giving the data to this phase. This phase
applies the summary (aggregation) type of calculations to the key-value pairs.
BIG DATA ANALYTICS (2017 REGULATION)

MapReduce:
 The client specifies the file for input to the Map function. It splits it into tuples.
 Map function defines key and value from the input file. The output of the map function is this key-value
pair.
 MapReduce framework sorts the key-value pair from map function.
 The framework merges the tuples having the same key together.
 The reducers get these merged key-value pairs as input.
 Reducer applies aggregate functions on key-value pair.
 The output from the reducer gets written to HDFS.
 MapReduce framework takes care of the failure. It recovers data from another node in an event where one
node goes down.
BIG DATA ANALYTICS (2017 REGULATION)

Working of MapReduce:

The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing.
BIG DATA ANALYTICS (2017 REGULATION)

YARN:
 Yarn which is short for Yet Another Resource Manager. It is like the operating system of Hadoop as it
monitors and manages the resources.
 Yarn allows different data processing engines like graph processing, interactive processing, stream
processing as well as batch processing to run and process data stored in HDFS.
BIG DATA ANALYTICS (2017 REGULATION)

Yarn has two main components:

Node Manager: It is Yarn’s per-node agent and takes care of the individual compute nodes in a Hadoop cluster.
It monitors the resource usage like CPU, memory etc. of the local node and intimates the same to Resource Manager.

Resource Manager: It is responsible for tracking the resources in the cluster and scheduling tasks like map-reduce
jobs. (runs on the Master machine) It has two components: Scheduler & Application Manager.

Application Master has two functions and they are:-


 Negotiating resources from Resource Manager
 Working with NodeManager to monitor and execute the sub-task.

Following are the functions of Resource Scheduler:-


 It allocates resources to various running applications But it does not monitor the status of the application. So
in the event of failure of the task, it does not restart the same.
 We have another concept called Container. It is nothing but a fraction of NodeManager capacity i.e. CPU,
memory, disk, network etc.

You might also like