CHAPTER 03: Big Data Technology Landscape
CHAPTER 03: Big Data Technology Landscape
CHAPTER 03: Big Data Technology Landscape
Agenda
NoSQL (Not Only SQL)
• The term NoSQL was coined by Carlo Strozzi in the year
1998. He used this term to name his Open Source, Light
Weight, Database which did not have an SQL interface.
• It is triggered by the needs of Web 2.0 companies such
as Facebook, Google, and Amazon.com.
• Most NoSQL databases offer a concept of "
eventual consistency" in which database changes are
propagated to all nodes "eventually" (typically within
milliseconds).
Types of databases
Column oriented database
A column store database can also be referred to as a:
• Column database
• Column family database
• Column oriented database
• Wide column store database
• Wide column store
• Columnar database
• Columnar store
Where NoSQL is used?
• Big data ( SD+USD+SSD [3Vs Data]) [IOT data]
• Real time (Time based data) web applications.
• Log data storage and analysis
• Social media data storage and analysis.
Example:
– Amazon's Shopping Cart (Amazon Dynamo is used)
What is NoSQL?
• NoSQL refers to a general class of storage engines that store
data in a non-relational format.
• A NoSQL is a non-relational (No tables), open source
distributed database used for dealing with big data (SSD,
USD and SD).
• NoSQL encompasses a wide range of technologies and
architectures, to solve the scalability and big data .
• NoSQL databases can store non-relational data on a super
large scale (horizontal scaling), and can solve problems
regular databases can't handle: indexing the entire
Internet, predicting subscriber behavior, or targeting ads on
a platform such as Facebook, etc.
What is NoSQL?
• NoSQL databases especially target large sets of
distributed data.
• NoSQL databases are sometimes referred to as cloud
databases, non-relational databases, or Big Data
databases.
• NoSQL databases have become the first alternative to
relational databases, with high performance, scalability
(Horizontal), availability, and fault tolerance are being
the key deciding factors.
NoSQL features
1. NoSQL databases are non-relational: No adhere to
relational data model.
2. Distributed: Data is distributed across several nodes
in a cluster of low-cost commodity hardware.
3. No support for ACID properties (Atomicity,
Consistency, Isolation, and Durability): They adhere
to CAP theorem.
4. No fixed table schema: Support for flexible schema
i.e. no mandate for the data to strictly adhere to any
schema structure at the time of storage.
Types of NoSQL databases
1. Key-Value store
2. Document oriented store
3. Column oriented store
4. Graph oriented database
Key-Value store/databases
• These databases are designed for storing data in a
schema-less way.
• In a key-value store, all of the data consists of an
indexed key and a value, hence the name.
• It maintains a big hash table of keys and values.
• Examples:
– DyanmoDB (Amazon), Azure Table Storage (ATS), Riak,
BerkeleyDB.
– Shopping carts, web user data analysis (Amazon and
LinkedIn)
Key-Value pair Example
Document oriented store
• A document database is a type of non-relational database
that is designed to store semi-structured data as
documents, typically in JSON or XML format.
• Documents are grouped into "collections," which is similar
to a table in a relational database.
• A document database is used for storing, retrieving, and
managing semi-structured data.
• The data model in a document database is not structured in
a table format of rows and columns. The schema can vary,
providing more flexibility for data modeling.
• Examples : MongoDB, MarkLogic, Apache CouchDB, etc
Example
{
"FirstName": "Bob",
"Address": "5 Oak St.",
"Hobby": "sailing"
}
Applications of Document Databases
• Web analytics
• User preferences data
• Tweets
• Comments
• Sensor data from mobile devices
• Log files
• Real-time analytics
• Various other data from Internet of Things
• Product catalogs and so on.
Document Databases are used in:
• LinkedIn
• Dropbox Mailbox
Column oriented store
• Column store – (also known as wide-column
stores) instead of storing data in rows,
these databases are designed for storing data
tables as sections of columns of data, rather than
as rows of data.
• wide-column stores offer very high performance
and a highly scalable architecture.
– Examples include: Cassandra (Face book), HBase,
BigTable (Google) and HyperTable.
Example
Column oriented store
Column databases are used in:
• Google Earth, Maps
• The New York Times
• eBay
• Twitter
• Facebook
• Netflix (Streaming service: TV, News, Movies, etc.)
• Sensor feeds
• Web user actions analysis.
Google Earth
• Google Earth is a computer program that
renders a 3D representation of Earth based on
satellite imagery. The program maps the Earth
by superimposing satellite images, aerial
photography, and GIS data onto a 3D globe,
allowing users to see cities and landscapes from
various angles.
Graph oriented database
• A graph database, also called a graph-oriented
database, is a type of NoSQL database that uses
graph theory to store, map and query relationships.
• A graph database is an online database management
system with Create, Read, Update and Delete (CRUD)
operations working on a graph data model.
• Examples : Neo4j, Titan, Polyglot, HyperGraphDB,
InfiniteGraph
• Graph databases are used on Social Network, Walmart-
Upsell, Cross-sell, Recommendation.
Graph oriented database
• A graph database is essentially a collection of nodes and
edges.
• Each node represents an entity (such as a person or
business) and each edge represents a connection or
relationship between two nodes.
• Every node in a graph database is defined by a
unique identifier, a set of outgoing edges and/or
incoming edges and a set of properties expressed as
key/value pairs.
• Each edge is defined by a unique identifier, a starting-
place and/or ending-place node and a set of properties.
Example
Graph oriented database
Graph oriented database
Why NoSQL
1. It has a scale out architecture, i.e. consisting of
multiple low-cost computer servers and storage
components -- that are configured to create a
storage pool or are configured to increase
computing power. (Horizontal scalability)
2. It can house large volumes of SD, USD and SSD.
3. Supports dynamic schema: Database allows
insertion of data without a pre-defined schema. It
also facilitates application changes in real time.
Why NoSQL
4. Auto sharding : It automatically spreads data across
an arbitrary number of servers. It balances the load
of data and query on the available servers. It also
supports self healing capability.
5. Replication: It offers good support for replication
which in turn guarantees high availability, fault
tolerance, and disaster recovery.
6. Support large numbers of concurrent users (tens of
thousands, perhaps millions)
Why NoSQL?
7. In recent times we can easily capture and access
data from various sources, like Facebook, Google,
Twitter, Amazon, etc.
8. User’s personal information, geographic location
data, user generated content, social graphs and
machine logging data are some of the examples
where data is increasing rapidly.
9. Relational databases are not suitable for processing
large volume of data.
Advantages of NoSQL
1. Support elastic scaling:
a) Cluster scale: It allows distribution of database
across 100+ nodes often in multiple data centers.
b) Performance scale: It sustains over 100,000+
database reads and writes per second.
c) Data scale: It supports housing of 1 billion+
documents in the database.
2. Doesn’t require a pre-defined schema: Does not require
any adherence to pre-defined schema and supports
flexible schema
Example (MongoDB)
1. . {_id:101, “Book Name”: ”Fundamentals of business
analytics”, “Author Name”: “Seema Acharya”,
“Publisher” :”Wiley India”}
2. {_id:102, “Book name”: “Big data and analytics”}
Hadoop RDBMS
Scale out Scale Up
Key-Value Pair Record
MapReduce (Functional Style) SQL (Declarative)
De-normalized Normalized
All varieties of Data Structured Data
OLAP/Batch/Analytical Queries OLTP/Real time/ Queries
Integrated Hadoop systems
• EMC Greenplum.
• Oracle Big data Appliance
• Microsoft Big data solutions
• IBM InfoSphere
• HP Big data solutions, etc.
Cloud-based Hadoop Solutions