Chapter Two Data Science: by Abdulaziz Oumer
Chapter Two Data Science: by Abdulaziz Oumer
Chapter Two Data Science: by Abdulaziz Oumer
DATA SCIENCE
BY ABDULAZIZ OUMER
An Overview of Data Science
• Data science is a multi-disciplinary field that uses
scientific methods, processes, algorithms, and
systems to extract knowledge and insights from
structured, semi-structured and unstructured data.
• Data science is much more than simply analysing
data.
• It offers a range of roles and requires a range of
skills.
DATA Vs INFORMATION
DATA
• A representation of facts, concepts, or instructions
in a formalized manner, which should be suitable
for communication, interpretation, or processing,
by human or electronic machines.
• Unprocessed facts and figures
• Represented with the help of characters such as
alphabets (A-Z, a-z), digits (0-9) or special
characters (+, -, /, *, , =, etc.).
INFORMATION
• Is the processed data on which decisions and actions
are based.
• It is data that has been processed into a form that is
meaningful to the recipient.
• Is interpreted data; created from organized,
structured, and processed data in a particular
context.
Data Processing Cycle
• Is the re-structuring or re-ordering of
data by people or machines to
increase their usefulness and add
values for a particular purpose.
• Consists of the following basic steps
-input
-processing and
-output
…continued
❖ INPUT −in this step, the input data is prepared in some
convenient form for processing
• The form will depend on the processing machine
❖ PROCESSING −in this step, the input data is changed to
produce data in a more useful form
❖ OUTPUT −at this stage, the result of the proceeding
processing step is collected
• The particular form of the output data depends on the
use of the data
Data types and their representation
❖ Data types from Computer programming perspective
• Integers(int)- is used to store whole numbers,
mathematically known as integers
• Booleans(bool)- is used to represent restricted to one
of two values: true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real
numbers
• Alphanumeric strings(string)- used to store a
combination of characters and numbers
❖Data types from Data Analytics perspective
❑Structured Data
• Is data that adheres to a pre-defined data
model and is therefore straightforward to
analyse
• Conforms to a tabular format
• Common examples of structured data are
Excel files or SQL databases.
…continued
❑ Semi-structured Data
• Is a form of structured data that does not conform with the
formal structure of data models
• Contains tags or other markers to separate semantic elements
and enforce hierarchies of records and fields within the data
• Also known as a self-describing structure
• Examples of semi-structured data include JSON and XML are
forms of semi-structured data
…continued
❑ Unstructured Data
• Is information that either does not have a predefined data
model or is not organized in a pre-defined manner.
• Is typically text-heavy but may contain data such as dates,
numbers, and facts as well
• Difficult to understand as it have irregularities and ambiguity
• Common examples of unstructured data include audio, video
files or NoSQL databases.
…continued
❑Metadata
• Metadata is data about data.
• provides additional information about a specific set
of data.
• for example In a set of photographs metadata could
describe when and where the photos were taken.
DATA VALUE CHAIN
• Is introduced to describe the information flow
within a big data system as a series of steps needed
to generate value and useful insights from data.
• The Big Data Value Chain identifies the following key
high-level activities: -Data Acquisition
-Data Analysis
-Data Curation
-Data Storage
-Data Usage
Data Acquisition
❖Clustered Computing
• Clustering is a Machine Learning technique
that involves the grouping of data points
• Given a set of data points, we can use a
clustering algorithm to classify each data
point into a specific group
Big data clustering software combines
the resources of many smaller machines,
seeking to provide a number of benefits:
• Resource Pooling
• High Availability
• Easy Scalability
Hadoop and its Ecosystem
• Hadoop is an open-source framework
intended to make interaction with big data
easier
• Allows the distributed processing of large
datasets across clusters of computers using
simple programming models
The four key characteristics of
Hadoop are:
• Economical: Its systems are highly economical as
ordinary computers can be used for data processing
• Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware failure.
• Scalable: It is easily scalable both, horizontally and
vertically. A few extra nodes help in scaling up the
framework.
• Flexible: It is flexible and you can store as much
structured and unstructured data as you need to and
decide to use them later.
Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
It is continuously growing to meet the needs of Big Data. It
comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
• First stage of Big Data processing
• Data is ingested or transferred to Hadoop
from various sources such as relational
databases, systems, or local files
• Sqoop transfers data from RDBMS to HDFS,
whereas Flume transfers event data.
2. Processing the data in storage
• In second stage data is stored and
processed
• The data is stored in the distributed file
system HDFS, and the NoSQL distributed data, HBase
• Spark and MapReduce perform data
processing.
3. Computing and analysing data
• The third stage is to Analyse
• Here, the data is analysed by processing
frameworks such as Pig, Hive, and Impala
• Pig converts the data using a map and
reduce and then analyses it.
• Hive is also based on the map and reduce
programming and is most suitable for
structured data.
4. Visualizing the results
• The fourth stage is Access
• Performed by tools such as Hue and
Cloudera Search
• In this stage, the analysed data can
be accessed by users.
THANK YOU