CHAPTER 03: Big Data Technology Landscape

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 81

CHAPTER 03: Big Data technology landscape

Agenda
NoSQL (Not Only SQL)
• The term NoSQL was coined by Carlo Strozzi in the year
1998. He used this term to name his Open Source, Light
Weight, Database which did not have an SQL interface.
• It is triggered by the needs of Web 2.0 companies such
as Facebook, Google, and Amazon.com.
• Most NoSQL databases offer a concept of "
eventual consistency" in which database changes are
propagated to all nodes "eventually" (typically within
milliseconds).
Types of databases
Column oriented database
A column store database can also be referred to as a:

• Column database
• Column family database
• Column oriented database
• Wide column store database
• Wide column store
• Columnar database
• Columnar store
Where NoSQL is used?
• Big data ( SD+USD+SSD [3Vs Data]) [IOT data]
• Real time (Time based data) web applications.
• Log data storage and analysis
• Social media data storage and analysis.

Example:
– Amazon's Shopping Cart (Amazon Dynamo is used)
What is NoSQL?
• NoSQL refers to a general class of storage engines that store
data in a non-relational format.  
• A NoSQL is a non-relational (No tables), open source
distributed database used for dealing with big data (SSD,
USD and SD).
• NoSQL encompasses a wide range of technologies and
architectures, to solve the scalability and big data .
• NoSQL databases can store non-relational data on a super
large scale (horizontal scaling), and can solve problems
regular databases can't handle: indexing the entire
Internet, predicting subscriber behavior, or targeting ads on
a platform such as Facebook, etc. 
What is NoSQL?
• NoSQL databases especially target large sets of
distributed data.
• NoSQL databases are sometimes referred to as cloud
databases, non-relational databases, or Big Data
databases.
• NoSQL databases have become the first alternative to
relational databases, with high performance, scalability
(Horizontal), availability, and fault tolerance are being
the key deciding factors.
NoSQL features
1. NoSQL databases are non-relational: No adhere to
relational data model.
2. Distributed: Data is distributed across several nodes
in a cluster of low-cost commodity hardware.
3. No support for ACID properties (Atomicity,
Consistency, Isolation, and Durability): They adhere
to CAP theorem.
4. No fixed table schema: Support for flexible schema
i.e. no mandate for the data to strictly adhere to any
schema structure at the time of storage.
Types of NoSQL databases
1. Key-Value store
2. Document oriented store
3. Column oriented store
4. Graph oriented database
Key-Value store/databases
• These databases are designed for storing data in a
schema-less way.
• In a key-value store, all of the data consists of an
indexed key and a value, hence the name.
• It maintains a big hash table of keys and values.
• Examples:
– DyanmoDB (Amazon), Azure Table Storage (ATS), Riak,
BerkeleyDB.
– Shopping carts, web user data analysis (Amazon and
LinkedIn)
Key-Value pair Example
Document oriented store
• A document database is a type of non-relational database
that is designed to store semi-structured data as
documents, typically in JSON or XML format.
• Documents are grouped into "collections," which is similar
to a table in a relational database. 
• A document database is used for storing, retrieving, and
managing semi-structured data.
• The data model in a document database is not structured in
a table format of rows and columns. The schema can vary,
providing more flexibility for data modeling.
• Examples : MongoDB, MarkLogic, Apache CouchDB, etc
Example
{
"FirstName": "Bob",
"Address": "5 Oak St.",
"Hobby": "sailing"
}
Applications of Document Databases
• Web analytics
• User preferences data
• Tweets
• Comments
• Sensor data from mobile devices
• Log files
• Real-time analytics
• Various other data from Internet of Things
• Product catalogs and so on.
Document Databases are used in:

• LinkedIn
• Dropbox Mailbox
Column oriented store
• Column store – (also known as wide-column
stores) instead of storing data in rows,
these databases are designed for storing data
tables as sections of columns of data, rather than
as rows of data.
• wide-column stores offer very high performance
and a highly scalable architecture. 
– Examples include: Cassandra (Face book), HBase,
BigTable (Google) and HyperTable.
Example
Column oriented store
Column databases are used in:
• Google Earth, Maps
• The New York Times
• eBay
• Twitter
• Facebook
• Netflix (Streaming service: TV, News, Movies, etc.)
• Sensor feeds
• Web user actions analysis.
Google Earth
• Google Earth is a computer program that
renders a 3D representation of Earth based on
satellite imagery. The program maps the Earth
by superimposing satellite images, aerial
photography, and GIS data onto a 3D globe,
allowing users to see cities and landscapes from
various angles.
Graph oriented database
• A graph database, also called a graph-oriented
database, is a type of NoSQL database that uses
graph theory to store, map and query relationships.
• A graph database is an online database management
system with Create, Read, Update and Delete (CRUD)
operations working on a graph data model.
• Examples : Neo4j, Titan, Polyglot, HyperGraphDB,
InfiniteGraph
• Graph databases are used on Social Network, Walmart-
Upsell, Cross-sell, Recommendation.
Graph oriented database
• A graph database is essentially a collection of nodes and
edges.
• Each node represents an entity (such as a person or
business) and each edge represents a connection or
relationship between two nodes.
• Every node in a graph database is defined by a 
unique identifier, a set of outgoing edges and/or
incoming edges and a set of properties expressed as 
key/value pairs.
• Each edge is defined by a unique identifier, a starting-
place and/or ending-place node and a set of properties. 
Example
Graph oriented database
Graph oriented database
Why NoSQL
1. It has a scale out architecture, i.e. consisting of
multiple low-cost computer servers and storage
components -- that are configured to create a
storage pool or are configured to increase
computing power. (Horizontal scalability)
2. It can house large volumes of SD, USD and SSD.
3. Supports dynamic schema: Database allows
insertion of data without a pre-defined schema. It
also facilitates application changes in real time.
Why NoSQL
4. Auto sharding : It automatically spreads data across
an arbitrary number of servers. It balances the load
of data and query on the available servers. It also
supports self healing capability.
5. Replication: It offers good support for replication
which in turn guarantees high availability, fault
tolerance, and disaster recovery.
6. Support large numbers of concurrent users (tens of
thousands, perhaps millions)
Why NoSQL?
7. In recent times we can easily capture and access
data from various sources, like Facebook, Google,
Twitter, Amazon, etc.
8. User’s personal information, geographic location
data, user generated content, social graphs and
machine logging data are some of the examples
where data is increasing rapidly.
9. Relational databases are not suitable for processing
large volume of data.
Advantages of NoSQL
1. Support elastic scaling:
a) Cluster scale: It allows distribution of database
across 100+ nodes often in multiple data centers.
b) Performance scale: It sustains over 100,000+
database reads and writes per second.
c) Data scale: It supports housing of 1 billion+
documents in the database.
2. Doesn’t require a pre-defined schema: Does not require
any adherence to pre-defined schema and supports
flexible schema
Example (MongoDB)
1. . {_id:101, “Book Name”: ”Fundamentals of business
analytics”, “Author Name”: “Seema Acharya”,
“Publisher” :”Wiley India”}
2. {_id:102, “Book name”: “Big data and analytics”}

• These are stored as Key-Value pairs


Example (MongoDB)
1. . {_id:101, “Book Name”: ”Fundamentals of business
analytics”, “Author Name”: “Seema Acharya”,
“Publisher” :”Wiley India”}
2. {_id:102, “Book name”: “Big data and analytics”}

• These are stored as Key-Value pairs


Example (MongoDB)
{
ISBN: 9780992461225,
title: "JavaScript: Novice to Ninja",
author: "Darren Jones", MongoDB adds ID
year: 2014, automatically
format: "ebook",
price: 29.00,
description: "Learn JavaScript from scratch!",
rating: "5/5",
review: [
{ name: "A Reader", text: "The best JavaScript book I've ever
read." },
{ name: "JS Expert", text: "Recommended to novice and expert
developers alike." } ] }
Object ID in MongoDB
Example (MongoDB)
Advantages of NoSQL
3. Cheap and easy to implement and supports benefits of
scale, high availability, fault tolerance in low cost.
4. Relaxes the data consistency requirements: Adopts CAP
theorem (Adopts AP combination)
5. Data can be replicated on multiple nodes and can be
distributed:
a) Sharding: Automatically spread data across an arbitrary
number of servers, without requiring the application to
even be aware of the composition of the server pool.
b) Replication: Multiple copies of the data across the
cluster.
Difference between Scale-up and Scale-out
NoSQL is Shared Nothing.
Sharding
• Sharding is a type of database partitioning that splits
very large databases the into smaller, faster, more easily
managed parts called data shards.
• The word shard means a small part of a whole.
What we miss with NoSQL
• Doe not support Joins.
• Group by
• ACID properties
• Does not have standard SQL interface. But supports MQL
and CQL) [M- MongoDB, C-Cassandra]
• Does not support easy integration with other
applications that support SQL. 
Use of NoSQL in industry
• NoSQL is being used in varied industries. They are used
to support analysis for applications such as:
– Web user data analysis
– Log analysis
– Sensor feed analysis
– Making recommendations for upsell and cross-sell,
etc.
NoSQL Vendors
NewSQL
• It is a class of modern relational database management
systems that seek to provide the same scalable
performance of NoSQL systems for online transaction
processing (OLTP) while still maintaining the ACID
guarantees of a traditional database system. 
• NewSQL is a type of database language that incorporates
and builds on the concepts and principles of Structured
Query Language (SQL) and NoSQL languages.
• By combining the reliability of SQL with the speed and
performance of NoSQL, NewSQL provides improved
functionality and services. 
NewSQL
NewSQL Examples
Characteristics of NewSQL
• Based on shared nothing architecture.
• Supports SQL interface for application interaction.
• Supports ACID properties.
• Supports scale out/horizontal scalability.
SQL Vs. NoSQL Vs. NewSQL
SQL NoSQL NewSQL
Adherence to ACID Yes No Yes
properties

OLTP/OLAP Yes No Yes


Schema rigidity Yes No Maybe
Adherence to data model Adherence to
relational model

Data Format Flexibility No Yes Maybe

Scalability Scale up Vertical Scale out Horizontal Scale out


Scaling Scaling

Distributed Computing Yes Yes Yes

Community Support Huge Growing Slowly


growing
Hadoop
• Hadoop is an open-source project of the Apache
foundation.
• It is a framework written in Java and developed by
Dough Cutting in 2005. He named it Hadoop which is the
name of his son’s toy elephant.
• Apache Hadoop is a collection of open-source software
utilities that facilitate using a network of many
computers to solve problems involving massive amounts
of data and computation. It provides a software
framework for distributed storage and processing of big
data using the MapReduce programming model.
Hadoop
• Hadoop is an open-source framework that allows to
store and process big data in a distributed environment
across clusters of computers using simple programming
models.
Hadoop
Features of Hadoop
1. It is optimized to handle massive quantities of SD, SSD
and USD using commodity hardware (inexpensive
computers).
2. It has shared nothing architecture.
3. It replicates its data across multiple computers.
4. It is for high throughput rather than low latency or
response time because handling massive quantities of
data is a batch operation.
5. It supplements (Complements) OLTP and OLAP.
6. It is not good for non-parallel tasks and processing small
files. (Works best for huge data files & data sets).
Key advantages of Hadoop
1. Stores data in its native format (HDFS).
2. Scalable : store and distribute very large clusters
3. Cost effective: reduced cost/TB of storage and
processing.
4. Resilient to failure : Fault tolerant due to data
replication on multiple nodes in the cluster.
5. Flexibility: supports any data (SD, SSD and USD) analysis
such as email conversations, social media data analysis,
click-stream data analysis, log analysis, data mining,
market campaign analysis, etc.
6. Fast: move code to data paradigm. [Process Migration]
Versions of Hadoop

YARN -> Yet Another Resource Negotiator


Hadoop 1.0 parts
1. Data storage framework (HDFS):
– General purpose file system.
– It is schema-less and stores data files of any format.
– Stores data files as close to their original format and
this provides needed flexibility and agility.
2. Data processing framework:
– MapReduce model (Google’s popular model)
– Uses two functions: Map and Reduce functions to
process the data.
Data processing framework
• The “Mapper” take in set of key-value pairs and generate
intermediate data which is another list of key-value pairs.
• The “Reducers” acts on intermediate data and produce
the output data.
• The two functions work in isolation from one another,
thus enabling the processing to be highly distributed in a
highly-parallel, fault tolerant and scalable way.
Limitations of Hadoop 1.0
1. Requires expertise in MapReduce programming and
Java.
2. It supports only batch processing.
3. It is tightly coupled with MapReduce and hence every
data for analysis has to be transformed into MapReduce
structure.
Hadoop 2.0
• Apache Hadoop 2 (Hadoop 2.0) is the second iteration of
the Hadoop framework for distributed data processing. 
• Hadoop 2 adds support for running non-batch
applications through the introduction of YARN, a
redesigned cluster resource manager that eliminates
Hadoop's sole reliance on the MapReduce programming
model.
YARN
• YARN framework is responsible for Cluster resource
management.
• Cluster resource management means managing the
resources of the Hadoop Clusters. Resources mean
Memory, CPU etc. 
• YARN took over this task of cluster management from
MapReduce and MapReduce is streamlined to perform
Data Processing only in which it is best.
• Any application capable of dividing itself into parallel
tasks is supported by YARN.
YARN
• YARN is the brain of Hadoop Ecosystem. It performs all
the processing activities by allocating resources and
scheduling tasks.
• YARN co-ordinates the allocation of subtasks of the
submitted applications, thus enhances flexibility,
scalability and efficiency of the applications.
Overview of Hadoop Ecosystem
1. HDFS: It stores different types of large data sets (i.e.
structured, unstructured and semi structured data) as close
to original form.
2. Hbase (Hadoop’s database):  HBase is an open source, non-
relational distributed database. In other words, it is a
NoSQL database.
3. Hive: Facebook created HIVE for people who are fluent with
SQL. Thus, HIVE makes them feel at home while working in
a Hadoop Ecosystem. Basically, HIVE is a data warehousing
component which performs reading, writing and managing
large data sets in a distributed environment (Hadoop
Cluster) using SQL-like interface. (HIVE + SQL = HQL)
Overview of Hadoop Ecosystem
4. Pig: It gives a platform for building data flow for ETL
(Extract, Transform and Load), processing and analyzing
huge data sets.
– It is also known as Data Flow language.
– PIG has two parts: Pig Latin, the language and the pig
runtime, for the execution environment. It is similar to Java
and JVM.
– 10 line of pig latin = approx. 200 lines of Map-Reduce Java
code
5. ZooKeeper: is the coordinator of any Hadoop job which
includes a combination of various services in a Hadoop
Ecosystem for distributed applications.
Overview of Hadoop Ecosystem
6. Oozie: clock and alarm (scheduler) service inside
Hadoop Ecosystem. 
– It schedules Hadoop jobs and binds them together as one
logical work.
7. Mahout: It provides an environment for creating
machine learning applications which are scalable. 
8. Flume/Chukwa : which helps in storing unstructured
and semi-structured data into HDFS.
– It is data collection system
Overview of Hadoop Ecosystem
9. Sqoop: Import and export structured data from RDBMS
or Enterprise data warehouses to HDFS or vice versa.
10. Ambari: It aims at making Hadoop ecosystem more
manageable.
– It is web based tool for provisioning, managing and
monitoring Apache Hadoop clusters.
Hadoop distributions
• Open source Apache project (free download).
• The core aspects of Hadoop includes:
– Hadoop Common
– HDFS
– Hadoop YARN
– Hadoop MapReduce
Hadoop versus SQL

Hadoop RDBMS
Scale out Scale Up
Key-Value Pair Record
MapReduce (Functional Style) SQL (Declarative)
De-normalized Normalized
All varieties of Data Structured Data
OLAP/Batch/Analytical Queries OLTP/Real time/ Queries
Integrated Hadoop systems

• EMC Greenplum.
• Oracle Big data Appliance
• Microsoft Big data solutions
• IBM InfoSphere
• HP Big data solutions, etc.
Cloud-based Hadoop Solutions

• Amazon Web services (AWS)


• Google BigQuery
AWS

• Amazon Web Services offers reliable, scalable,


and inexpensive cloud computing services.
• Free to join, pay only for what you use.
• Amazon Web Services (AWS) is a subsidiary
of Amazon.com that provides on-demand cloud
computing platforms to individuals, companies
and governments, on a paid subscription basis
with a free-tier option available for 12 months.
Google BigQuery
• Google BigQuery is a cloud-based big data
analytics web service for processing very large
read-only data sets.
• BigQuery was designed for analyzing data on
the order of billions of rows, using a SQL-like
syntax. It runs on the Google Cloud Storage
infrastructure.

You might also like