Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)
Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)
Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)
MapReduce:
MapReduce is the data processing component of Hadoop.
MapReduce programming model is designed for processing large volumes of data in parallel by dividing the
work into a set of independent tasks. You need to put business logic in the way MapReduce works and rest
things will be taken care by the framework.
Map Phase – This phase takes input as key-value pairs and produces output as key-value pairs. It can write custom
business logic in this phase. Map phase processes the data and gives it to the next phase.
Reduce Phase – The MapReduce framework sorts the key-value pair before giving the data to this phase. This phase
applies the summary (aggregation) type of calculations to the key-value pairs.
BIG DATA ANALYTICS (2017 REGULATION)
MapReduce:
The client specifies the file for input to the Map function. It splits it into tuples.
Map function defines key and value from the input file. The output of the map function is this key-value
pair.
MapReduce framework sorts the key-value pair from map function.
The framework merges the tuples having the same key together.
The reducers get these merged key-value pairs as input.
Reducer applies aggregate functions on key-value pair.
The output from the reducer gets written to HDFS.
MapReduce framework takes care of the failure. It recovers data from another node in an event where one
node goes down.
BIG DATA ANALYTICS (2017 REGULATION)
Working of MapReduce:
The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing.
BIG DATA ANALYTICS (2017 REGULATION)
YARN:
Yarn which is short for Yet Another Resource Manager. It is like the operating system of Hadoop as it
monitors and manages the resources.
Yarn allows different data processing engines like graph processing, interactive processing, stream
processing as well as batch processing to run and process data stored in HDFS.
BIG DATA ANALYTICS (2017 REGULATION)
Node Manager: It is Yarn’s per-node agent and takes care of the individual compute nodes in a Hadoop cluster.
It monitors the resource usage like CPU, memory etc. of the local node and intimates the same to Resource Manager.
Resource Manager: It is responsible for tracking the resources in the cluster and scheduling tasks like map-reduce
jobs. (runs on the Master machine) It has two components: Scheduler & Application Manager.