Advanced Data Management Techniques

UNIT

01 Introduction to Data Management

Techniques
Names of Sub-Units

Introduction to Data Management, Types of Data, Overview of Data Models, Introduction to OLTP,
Dimensional Modeling Life Cycle, New Trends in Computing, Motivation and Need for Advanced Data
Management Techniques.
Overview
The unit begins by explaining types of data. Further, the unit provides the overview of data models.
The unit also familiarises you with the importance of OLTP and its limitations. Towards the end, you
will be acquainted with dimensional modelling life cycle and motivation and need for advanced data
management techniques.
Learning Objectives
In this unit, you will learn to:

 Explain the types of data
 Discuss the concept of data models
 Describe the significance of OLTP in data management
 State the importance of dimensional modelling life cycle
 Explain the motivation and need for advanced data management
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Learning Outcomes
At the end of this Unit, you would:

 Examine the types of data
 Evaluate the different types of data models
 Examine how data models principles are used.
 Assess the need for advanced data management
1.1 INTRODUCTION
Data management is the process of absorbing, storing, organising and preserving an organisation’s
data. Effective data management is a critical component of establishing IT systems that operate business
applications and offer analytical information to enable corporate executives, business managers and
other end-users to drive operational decision-making and strategic planning.
The data management process consists of a number of tasks that work together to ensure that the data
in business systems is correct, available and accessible. The majority of the needed work is done by IT
and data management teams, but business users often engage in various portions of the process to
ensure that the data fulfils their needs.
Data management has become increasingly important as firms face a growing number of regulatory
compliance obligations. Furthermore, businesses are gathering ever-increasing amounts of data and
a broader range of data types, both of which are trademarks of the big data platforms that many have
adopted. Without proper data management, such settings may become cumbersome and difficult to
traverse.
1.2 TYPES OF DATA

Everyone be them individuals or organisations are surrounded by data. Organisations have data related
to customers, shareholders, suppliers, employees, products and so on. On the other hand, individuals
have a lot of personal data in their computer systems. Data provides organisations with actionable
insights that can assist in designing existing campaigns, managing new goods, or performing new
experiments. We are approaching a digital era in which there a large amount of data will be generated
on a daily basis. For example, a firm like Flipkart generates more than 2TB of data every day.
Since data is so crucial in our lives, it is critical that it is correctly stored and processed without any
errors. When dealing with datasets, the type of data plays an essential role in determining which pre-
processing approach would work best for a certain set to obtain the best results. In other words, data
type is basically a trait associated with a piece of data that instructs a computer system on how to
interpret its value. By understanding data types, one can ensure that data is collected in the desired
format and the value of each property is as expected.
2
UNIT 01: Introduction to Data Management Techniques JGI JAIN
There are broadly two types of data, as shown in Figure 1:
Types of Data
Qualitative Data Quantitative Data
Nominal Discrete
Ordinal Continuous
Figure 1: Data Types

Let us discuss these data types as follows:
 Qualitative data: Qualitative or categorical data describes the object under consideration using a
finite set of discrete classes. It means that this type of data cannot be counted or measured easily
using numbers and therefore divided into categories. The gender of a person (male, female, or
others) is a good example of this data type.
These are usually extracted from audio, images, or text medium. Another example can be of a
smartphone brand that provides information about the current rating, the colour of the phone,
category of the phone, and so on. There are two subcategories under qualitative data:
 Nominal: These are the values that do not have a natural ordering. For example, it is impossible
to say that ‘Red’ is greater than ‘Blue.’ Another aspect in which we cannot distinguish between
male, female, or others is a person’s gender. The nominal data type for mobile phone categories,
whether midrange, budget segment, or luxury smartphone, is likewise the nominal data type.
 Ordinal: These values have a natural ordering while remaining within their class of values.
When it comes to clothing brands, we can easily classify them according to their size tag in the
order of small, medium and big. The grading method used to mark candidates in an exam may
also be thought of as an ordinal data type, with A+ clearly superior to a B grade.
These categories assist us in determining which encoding approach is appropriate for which type of
data. Because machine learning models are mathematical in nature, data encoding for qualitative
data is necessary because these values cannot be handled directly by the models and must be
translated to numerical kinds. For nominal data types where there is no comparison among the
categories, one-hot encoding may be used, which is comparable to binary coding because there are
fewer numbers, and for ordinal data types, label encoding, which is a sort of integer encoding, can
be used.
 Quantitative data: This data type attempts to quantify things by taking into account numerical
values that make it countable in nature. The price of a smartphone, the discount offered, the number
of ratings on a product, the frequency of a smartphone’s processor, or the RAM of that particular
phone are all examples of quantitative data types.
The important thing to remember is that a feature can have an endless number of values. For
example, the price of a smartphone can range from ‘x’ amount to any value, and it can be further
subdivided into fractional amounts.
3
JGI JAINDEEMED-TO-BE UNI VE RSI TY
The two classifications that best define them are:

 Discrete: This category includes numerical values that are either integers or whole numbers.
The number of speakers in a phone, cameras, processing cores, and the number of SIM cards
supported are all instances of discrete data.
 Continuous: Fractional numbers are treated as continuous values. These might include the
processors’ working frequency, the phone’s Android version, Wi-Fi frequency, temperature, and
so on.
Another way of categorising data is the way it is arranged. Under this category, data is classified as
structured, semi-structured and unstructured. Let us understand them in detail.
1.2.1 Structured
Structured data is arranged into a prepared repository, usually a database, so that its pieces may be
addressed for more efficient processing and analysis. In a database, for example, each field is distinct,
and its data may be accessed alone or in conjunction with data from other fields, in a number of
combinations. The database’s strength is its capacity to make data comprehensive enough to offer
relevant information. SQL (standard query language) is a database query language that allows a
database administrator to interface with the database.
Structured data is distinguished from unstructured and semi-structured data. The three types of data
may be thought of as being on a scale with unstructured data being the least formatted and structured
data being the most formatted. As data becomes more organised, it becomes more receptive to
processing.
1.2.2 Semi-structured
Semi-structured data is arranged into a specific repository, such as a database, but includes associated
information, such as metadata, that makes it easier to process than raw data. In other words, semi-
structured data falls somewhere in between, which means it is neither fully structured nor unstructured.
It is not structured in a way that allows for advanced access and analysis; nonetheless, it may include
information connected with it, such as metadata tagging that allows components contained to be
addressed. Unstructured data is commonly assumed to be a Word document. However, metadata tags
can be added in the form of keywords that reflect the document’s content; as a result, the document can
be easily found if those keywords are searched for. The data is now semi-structured. Nonetheless, the
document lacks the database’s deep arrangement and hence falls short of being completely organised
data.
1.2.3 Unstructured Data

Unstructured data is arranged into a format that allows it to be accessed and processed more easily. In
actuality, only a small percentage of data is entirely unstructured. Even papers and photographs, which
are commonly regarded as unstructured data, are organised to some level. Unstructured data does not
adhere to traditional data models, making it challenging to store and manage in a traditional relational
database.
Although unstructured data has an underlying structure, it lacks a specified data model or schema. It
might be either textual or non-textual. It can be created by either humans or machines. Text is one of
the most popular forms of unstructured data. Unstructured text is created and collected in a variety of
4
formats, including Word documents, email messages, PowerPoint presentations, survey replies, contact
centre transcripts, and blog and social media postings. Images, audio, and video files are examples of
unstructured data. Machine data is another type of unstructured data that is rapidly developing in many
organisations. For example, log files from websites, servers, networks, and apps, particularly mobile
applications, provide related to activity and performance. Furthermore, businesses are increasingly
capturing and analysing data from sensors on industrial equipment and other IoT-connected devices.
1.3 OVERVIEW OF DATA MODELS

A data model is a representation of the logical interrelationships and data flow between various
data items in the information domain. It also describes how data is stored and accessed. Data models
assist in describing what data is needed and in what format it should be utilised for various business
operations. A data model might be either tangible or abstract in nature. It is made up of the following
major components:
 Data types
 Data items
 Data sources
 Event sources
 Links
Data models include vital information for organisations because they define the relationships between
database tables, foreign keys, and the events involved. The three fundamental data model styles are as
follows:
 Conceptual data model: A conceptual data model is a logical representation of database ideas
and their connections. The goal of developing a conceptual data model is to define entities, their
properties, and their connections. There is very little information about the actual database structure
at this level of data modelling. A conceptual data model is often created by business stakeholders
and data architects.
 Logical data model: The logical data model is used to specify the structure of data pieces as well as
their connections. The elements of the conceptual data model are supplemented by the logical data
model. The benefit of adopting a logical data model is that it serves as a basis for the Physical model.
The modelling framework, on the other hand, remains general.
There are no main or secondary keys declared at this data modeling level. You must validate and
change the connection details that were defined earlier for relationships at this data modelling level.
 Physical data model: A physical data model specifies how the data model is implemented in a
database. It provides database abstraction and aids in the creation of the schema. This is due to
the abundance of meta-data that a physical data model provides. By duplicating database column
keys, constraints, indexes, triggers, and other RDBMS characteristics, the physical data model aids
in visualising database structure.
1.4 INTRODUCTION TO OLTP

Online Analytical Processing (OLAP) refers to a type of software that allows users to examine data from
various database systems at the same time. It is a tool that allows analysts to extract and examine
data from different perspectives. Analysts must regularly organise, aggregate and combine data. These
OLAP processes in data mining need a lot of resources. Data may be pre-calculated and pre-aggregated
using OLAP, making analysis quicker.
5
JGI JAIN
OLAP software is used to do multidimensional analysis on huge amounts of data from a data warehouse,
data mart, or other unified, centralised data repository at fast speeds. Most corporate data have
several dimensions that are, multiple categories into which the data is divided for display, monitoring
or analysis. Sales data, for example, may include various aspects linked to place (region, nation, state/
province, shop), time (year, month, week, day), product (clothing, men/women/children, brand, kind),
and other factors. However, data sets in a data warehouse are kept in tables, each of which may organise
data into just two of these dimensions at a time. OLAP takes data from various relational data sets and
reorganises it into a multidimensional structure for quick processing and in-depth analysis.
1.4.1 Need of OLTP

An Online Transaction Processing (OLTP) system logs and stores data related to transactions in a
database. Individual database entries consist of numerous fields or columns used in each transaction.
Banking and credit card activity are the two examples. As OLTP databases are read, written, and
updated often, the emphasis in OLTP is on quick processing. If a transaction fails, built-in system logic
guarantees that the data is not corrupted.
For data mining, analytics and business intelligence initiatives, OLAP applies complicated queries to
massive volumes of historical data gathered from OLTP databases and other sources. The emphasis of
OLAP is on response time to these sophisticated queries. Each query comprises one or more columns of
data collected from a large number of rows. Year-over-year financial performance or marketing lead
generation patterns are the two examples. OLAP databases and data warehouses enable analysts and
decision-makers to transform data into information by allowing them to employ bespoke reporting
tools. The failure of a query in OLAP does not stop or delay transaction processing for customers, but it
might delay or damage the accuracy of business intelligence insights.
1.4.2 advantage of OLTP

The main advantage of OLTP is that it can handle many transaction requests simultaneously and to
reliably backup and continue in case part of the system fails. The following are some other advantages
of the OLTP system:
 Provides an accurate revenue and cost prediction
 Provides timely update of all transactions
 Simplifies transactions on behalf of customers
 Supports larger databases
 Performs data partitioning for data manipulation
 Utilises consistency and concurrency to complete activities that assure its better availability
1.4.3 Limitations of OLTP

The following are the limitations of the OLTP system:
 If the OLTP system has hardware issues, online transactions are adversely impacted.
 OLTP systems enable several users to view and modify the same data at the same time, which
frequently results in an unusual situation.
 If the server lags for a few seconds, a significant number of transactions may be impacted.
 In order to manage inventories, OLTP necessitate a large number of employees working in groups.
6
 OLTP lacks suitable means for sending merchandise to customers.

 OLTP increases the database’s vulnerability to hackers and attackers.
 In B2B transactions, there is a risk that both buyers and suppliers will miss out on the system’s
efficiency benefits.
 A server failure may result in the loss of substantial volumes of data from the database.
1.5 DIMENSIONAL MODELLING LIFE CYCLE

Dimensional Modelling (DM) is a data structure approach designed specifically for data storage in a
data warehouse. The goal of dimensional modelling is to improve the database for quicker data retrieval.
Ralph Kimball created the Dimensional Modelling idea, which comprises “fact” and “dimension” tables.
A data warehouse dimensional model is intended to read, summarise, and analyse numeric information
such as values, balances, counts, weights, and so on. Relation models, on the other hand, are ideal for
adding, updating, and deleting data in a real-time Online Transaction System.
These dimensional and relational models each have their own approach to storing data, which offers
distinct advantages. In the relational model, for example, normalisation and ER models eliminate data
redundancy. A dimensional model in a data warehouse, on the other hand, organises data in such a way
that it is easy to access information and produce reports.
1.6 NEW TRENDS IN COMPUTING

Businesses that acquire troves of fresh data (and likely save a large quantity of old data) have challenges
in organising and utilising it to their advantage. Much of this is due to constantly evolving and growing
technologies, which may quickly become burdensome for firms attempting to be more effective and
discrete in their data use. Keeping abreast with current trends and knowing the best solutions is quite
challenging for organisations. The following are current data management trends:
 The Cloud: Cloud computing continues to offer several benefits to organisations. The use of a multi-
cloud topology is gaining traction. As the use of Kubernetes, an open-source that automates the
deployment and management of cloud native applications, grows, apps may become incredibly
mobile. Due to this mobility, apps are going to the public cloud and being pulled back into a private
cloud in a two-way street.
Security is becoming more stringent as businesses such as VMware expand their capabilities for
securing an organisation’s infrastructure. Machine learning is being infused into practically every
area of IT architecture. Improved insights are proven to be a game-changer for businesses trying to
understand their infrastructures.
 automated data handling: Vendors are integrating machine learning and AI engines to make
self-configuring and self-tuning processes more common. This significantly reduces manual data
handling in organisations. Many manual procedures are automated by these processes, allowing
people with fewer technical abilities to be more independent while accessing data.
 Infrastructure for machine learning: Machine Learning (ML), like AI, has reputation for being too
complex for most enterprises, but the payback is practically priceless. Machine learning-powered
automated solutions may be quite useful in data management since they extract information and
combine it into easily digestible visual representations. Productivity, ML solutions are being included
in the majority of cloud-based SaaS products or supplied as a stand-alone cloud-based architecture
that is simple to set up and use.
7
JGI JAIN
 Graph databases: Patterns and correlations may be defined in graph databases using mathematical
structures that reflect pairwise relationships between items. Graph databases outperform typical
relational databases when dealing with huge datasets. Graph databases are made up of nodes or
data items that are linked together. These connections or commonalities endure between nodes.
This can be useful for merchants in identifying consumer behaviour and purchasing trends. Graph
databases behave differently than relational databases.
 Data protection: Data security is usually a top trend in data management, as organisations
emphasise data integrity and reduce risks of data breaches and loss. Secure data solutions, whether
on-premises or in hybrid multi-cloud settings, enable you to get improved visibility and insights to
analyse and address risks. They also implement real-time controls and conform to a plethora of
compliance mandates such as GDPR, PCI, HIPAA, and SOX. Security infrastructure based on SIEM,
SOAR, and SASE produces highly automated detection and response systems that provide security
professionals with the most recent breakthroughs in corporate defence.
 artificial Intelligence (aI): Artificial intelligence is sometimes referred to as limited or weak AI since
it is meant to accomplish a small or particular purpose. Google search, picture recognition software,
personal assistants like Siri and Alexa, and self-driving cars are all instances of limited AI. AI reliably
and without tiredness executes frequent, high-volume computerised tasks. AI searches for structure
and regularities in data to learn a skill, transforming the algorithm into a classifier or predictor. As
a result, just as the algorithm can teach itself how to play chess, it can also educate itself on what
product to promote next on the Internet. When presented with fresh data, the models evolve over
time.
 Persistent Memory (PMEM): Businesses expect higher performance when apps based on machine
learning or artificial intelligence (AI) operate in real-time. Keeping more data near to the processor
and in a permanent state gives crucial benefits such as higher throughput and lower latency.
PMEM, which was created in collaboration between Intel and Micron, allows Intel-based servers to
extend their memory footprint up to 4 TB in capacity. This delivers a significant speed improvement
for in-memory databases or locally saved information. PMEM storage medium outperforms NVMe
devices in terms of speed.
 Natural Language Processing: The combination between human language and technology is
known as natural language processing (NLP). Voice-activated technology such as Alexa, Siri, and
Google Home are well-known instances of natural language processing. NLP acts as a front-end
to an Artificial Intelligence backend, converting speech queries into actionable results. As NLP
advances, so does the gap in voice search data. Businesses must get on board with voice searches,
which are estimated to account for 50% of all searches by 2020. NLP assists in the processing and
collection of voice-based data, and it may also function internally for individuals who require data
access via voice requests.
 Enhanced analytics: Augmented analytics augments how individuals explore and analyse data in
analytics and BI systems by utilising enabling technologies like machine learning and AI to aid with
data preparation, insight production and insight explanation. Because the analysis is automated
and can be configured to run continually, the heavy lifting of manually sorting through enormous
volumes of complicated data (due to a lack of skills or time restrictions) is considerably minimised.
Augmented data preparation combines data from numerous sources considerably more quickly.
 Embedded analytics: Those who understand the realm of analytics are also aware of how time-
consuming and ineffective data translation can be. Embedded analytics is a digital workplace
feature that allows data analysis to take place within a user’s natural workflow rather than
switching to another application. Furthermore, embedded analytics is typically used to optimise
certain marketing campaigns, sales lead conversions, inventory demand planning and financial
8
budgeting. Embedded analytics may help an organisation to decrease the amount of time analytics
professionals spend transforming data into readable insights for everyone. That means they can
do what they do best: evaluate data to generate answers and plans. It even enables employees to
engage with data and create simple visuals for better decision-making.
1.7 MOTIVATION AND NEED FOR ADVANCED DATA MANAGEMENT TECHNIQUES

Advanced data management has long been at the heart of effective database and information systems.
Recent trends such as big data and cloud computing have increased the demand for sophisticated and
adaptable data storage and processing solutions.
The idea of advanced data management has emerged in recent decades, with a concentration on
data structures and query languages. It covers a wide range of data models and the fundamentals of
organising, processing, storing, and querying data using these models.
NoSQL/new SQL databases are a new breed of databases that should be targeted. Advanced database
management systems also enable emerging data management trends driven by application
requirements, such as advanced analytics, stream processing systems, and main memory data
processing.
The majority of cutting-edge database research is aimed at:
 Unstructured and semi-structured data
 Online analytical processing (OLAP)
 Data and computational resources that are distributed
 Decentralised control over the spread of resources
 Real-time dynamic response to external event
 Incorporation into the physical reality in which the organisation exists
These capabilities are considered advanced in relation to DBMSs. Many complex database management
systems are available as cloud services, although others remain available as software solutions, the
most prominent of which being MongoDB. Navicat for MongoDB is a wonderful tool to use if you want to
understand more about it. It supports database objects including Collections, Views, Functions, Indexes,
GridFS, and MapReduce. An object designer is also available for creating, modifying and designing
database objects.
Advanced data management has long been at the heart of effective database and information systems.
It covers a wide range of data models and the fundamentals of organising, processing, storing and
querying data using these models.
Conclusion 1.8 CONCLUSION
 Data management is the process of absorbing, storing, organising and preserving an organisation’s
data.
 The data management process consists of a number of tasks that work together to ensure that the
data in business systems is correct, available and accessible.
 When dealing with datasets, the type of data plays an essential role in determining which
preprocessing approach would work best for a certain set to obtain the best results
9
JGI JAIN
 Structured data is arranged into a prepared repository, usually a database, so that its pieces may be
addressed for more efficient processing and analysis
 Semi-structured data is arranged into a specific repository, such as a database, but includes
associated information, such as metadata, that makes it easier to process than raw data.
 Unstructured data is arranged into a format that allows it to be accessed and processed more easily
 A data model is a representation of the logical interrelationships and data flow between various
data items in the information domain.
 Data models include vital information for organisations because they define the relationships
between database tables, foreign keys, and the events involved.
 Online Analytical Processing (OLAP) refers to a type of software that allows users to examine data
from various database systems at the same time.
 Dimensional Modelling (DM) is a data structure approach designed specifically for data storage in
a data warehouse.
1.9 GLOSSARY
 Database: A set of data that is organised to allow users to find and retrieve it quickly and easily
 DBMS: A computerised database management system
 OLaP: A powerful analysis tool used for forecasting, statistical computations and aggregations
 Data management: It is the process of absorbing, storing, organising and preserving an
organisation’s data.
 Structured data: It is arranged into a prepared repository, usually a database, so that its pieces may
be addressed for more efficient processing and analysis
 Data model: A representation of the logical interrelationships and data flow between various data
items in the information domain.
 Dimensional Modelling (DM): It is a data structure approach designed specifically for data storage
in a data warehouse.
1.10 SELF-ASSESSMENT QUESTIONS
A. Essay Type Questions

1. Explain quantitative data in detail.
2. Distinguish between structured, semi-structured and unstructured data.
3. Discuss the importance of OLAP.
4. What do you mean by artificial intelligence?
5. Discuss embedded analytics.
10
1.11 ANSWERS AND HINTS FOR SELF-ASSESSMENT QUESTIONS
B. Hints for Essay Type Questions

1. Qualitative or categorical data describes the object under consideration using a finite set of discrete
classes. Refer to Section Types of Data
2. Structured data is arranged into a prepared repository; semi-structured data is arranged into a
specific repository, such as a database, but includes associated information; and unstructured data
is arranged into a format that allows it to be accessed and processed more easily. Refer to Section
Types of Data
3. Online Analytical Processing (OLAP) refers to a type of software that allows users to examine data
from various database systems at the same time. Refer to Section Introduction to OLTP
4. Artificial intelligence is sometimes referred to as limited or weak AI since it is meant to accomplish
a small or particular purpose. Refer to Section New Trends in Computing
5. Embedded analytics is a digital workplace feature that allows data analysis to take place within a
user’s natural workflow rather than switching to another application. Refer to Section New Trends
in Computing
@ 1.12 POST-UNIT READING MATERIAL
 https://pdfs.semanticscholar.org/d127/c1bf1fb31fe33f054f3bd2dc4f44c0615987.pdf
1.13 TOPICS FOR DISCUSSION FORUMS
 Using various sources, find the applications of OLAP in the real-world and discuss it with your
classmates.
11
UNIT
02 Advanced SQL
Names of Sub-Units
Complex Retrieval Queries using the Group By Clause, Recursive Queries, Nested Queries, Specifying
Constraints as Assertions, Event Condition Action (ECA) model (Triggers) in SQL, Creating and
working with Views in SQL, Database Programming, Embedded SQL, Dynamic SQL, SQLJ, and Stored
Procedures in SQL.
Overview
The unit begins by discussing the concept of complex retrieval queries with the help of the Group By
clause. Next, the unit discusses the concept of recursive queries, nested queries, and specify constraints
as assertions. Further the unit discusses the Event Condition Action (ECA) model in SQL. Next, the unit
discusses the process to create views in SQL and database programming, The unit also discusses the
embedded SQL, dynamic SQL, and SQLJ. Towards the end, the unit discusses the stored procedures in
SQL.
Learning Objectives

 Describe the concept of complex retrieval queries with the help of the Group By clause
 Become familiar with the concept of recursive queries, nested queries, and specify constraints as
assertions.
 Define the Event Condition Action (ECA) model, views in SQL and database programming
 Become familiar with embedded SQL, dynamic SQL, and SQLJ
 Explain the concept of stored procedures in SQL
JGI JAIN
Learning Outcomes
At the end of this unit, you would:

 Examine the concept of complex retrieval queries with the help of the Group By clause
 Evaluate the knowledge about recursive queries, nested queries, and specify constraints as
assertions
 Analyse the Event Condition Action (ECA) model and views in SQL
 Assess the significance of embedded SQL and dynamic SQL
 Examine the concept of stored procedures in SQL
Pre-Unit Preparatory Material
 http://tinman.cs.gsu.edu/~raj/books/Oracle9-chapter-6.pdf
2.1 INTRODUCTION
The database is becoming more interconnected, and this has produced a demand for standard language
that can be utilized in a variety of computer contexts. SQL has proven to be the standard language
because it allows programmers or developers to learn a single set of instructions and use them to
generate, retrieve, edit, or transfer data regardless of whether they are working on a PC, workstation,
mini, or mainframe.
SQL is an abbreviation for Structured Query Language. SQL is used in a database management system
to build, delete, and edit databases and database objects, as well as to save, retrieve, and update data
in a database. SQL is a language that is used to create, access, and manipulate database management
systems. SQL is compatible with all current relational database management systems, including SQL
Server, Oracle, and MySQL.
2.2 COMPLEX RETRIEVAL QUERIES USING GROUP BY

The GROUP BY clause is a SQL statement that is used to group rows with similar values. In the SELECT
query, the GROUP BY clause is utilized. It is optionally used in combination with aggregate methods to
generate database summary reports.
GROUP BY queries are known as grouped queries since they only return a single row for each grouped
item.
The syntax of the SQL GROUP BY clause is as follows:
SELECT statements... GROUP BY column_name1[, column_name2,...] [HAVING
condition];
2.2.1 Grouping using Multiple Columns

Assume we want to acquire a list of movie category ids and the years in which they were released. The
SQL command for finding query is as follows:
SELECT 'category_id', 'year_released' FROM 'movies';
2
UNIT 02: Advanced SQL JGI JAIN
The output of the given query is as follows:
category_id year_released
1 2011
2 2008
NULL 2008
NULL 2010
8 2007
6 2007
6 2007
8 2005
NULL 2012
7 1920
The given result has several duplicates. Let’s do the same query in SQL using group by:
SELECT 'category_id','year_released' FROM 'movies' GROUP BY 'category_
id','year_released';
When we run the above script in MySQL workbench against the myflixdb database, we get the results
displayed below:
category_id year_released
NULL 2008
NULL 2010
NULL 2012
1 2011
2 2008
6 2007
7 1920
8 2005
8 2007
In our previous example, the GROUP BY clause uses both the category id and the year released to identify
unique data.
If a row has the same category id but a different year of release, it is considered unique. If the category
id and year released are the same for more than one row, it is deemed a duplication, and only one row
is displayed.
3
JGI JAIN
2.2.2 Grouping and Aggregate Functions

Assume we want to know the total number of males and females in our database. To do this, we may use
the script described below:
SELECT 'gender', COUNT('membership_number') FROM 'members' GROUP BY 'gender';
When we run the above script in MySQL workbench against the myflixdb database, we get the following
results:
Gender COUNT (‘membership number’)

Female 3
Male 5
The following results are grouped by each unique gender value posted, and the number of grouped rows
is tallied using the COUNT aggregate function.
The SQL GROUP BY Clause is used to group rows that have the same values. The SQL SELECT statement
is used in conjunction with the GROUP BY Clause. The GROUP BY clause’s SELECT statement can only
contain column names, aggregation functions, constants, and expressions. The SQL Having Clause is
used to limit the results of the GROUP BY clause. The MYSQL GROUP BY Clause is used to aggregate data
from several records and return records that are defined by one or more columns.
2.3 RECURSIVE QUERIES

A recursive query is a method of querying data hierarchies such as an organizational structure, bill-of-
materials, or document hierarchy.
Recursion is usually defined by three steps:
1. Establishment
2. Recursion, or the repetition of reasoning through the hierarchy over and over again.
3. Discontinuation
A recursive query, similarly, includes three execution phases:
1. Make a first result set.
2. Recursion depending on the current result set.
3. Finally, run the query to get the final result set.
A recursive query may be specified as follows:

1. Using the WITH RECURSIVE clause before a query
2. Using the RECURSIVE clause in a CREATE VIEW statement to create a view
2.3.1 Using the WITH RECURSIVE Clause

Consider the employee table below:
CREATE TABLE employee
(employee_number INTEGER,
manager_employee_number INTEGER,
last_name CHAR (20),
first_name VARCHAR (30));
4
The created table depicts an organisational structure with a hierarchical structure of employee-
manager data. Figure 1 displays the employee table in a hierarchical order:
Employee # = 801
Manager Employee # = NULL
Employee # = 1003 Manager Employee # = 1019 Employee # = 1016

Employee # = 801 Manager Employee # = 801 Manager Employee # = 801
Employee # = 1008 Employee # = 1006 Employee # = 1014 Employee # = 1011

Manager Employee # = 1019 Manager Employee # = 1019 Manager Employee # = 1019 Manager Employee # = 1019
Employee # = 1010 Employee # = 1001 Employee # = 1004

Manager Employee # = 1003 Manager Employee # = 1003 Manager Employee # = 1003
Employee # = 1012 Employee # = 1002 Employee # = 1015

Manager Employee # = 1004 Manager Employee # = 1004 Manager Employee # = 1004
Figure 1: Employee Table in Hierarchical Order

The recursive query below returns the employee numbers of all workers who report directly or indirectly
to the manager with employee_number 801:
WITH RECURSIVE temp_table (employee_number) AS
(SELECTroot. employee_number
FROM employee root WHERE root. manager_employee_number = 801
UNION ALL
SELECT indirect.employee_number FROM temp_table direct, employee indirect
WHERE direct.employee_number = indirect.manager_employee_number
)
SELECT * FROM temp_table ORDER BY employee_number;
In the example, the temp table is a temporary named result set that may be referenced in the recursive
statements FROM clause.
The non-recursive, or seed, the statement creates the first result set in the temp table, which contains
the workers with an employee_number 801 who report directly to the manager:
SELECT root. employee_number
FROM employee root
WHERE root. manager_employee_number = 801
Recursion occurs by connecting each employee in temp_table with workers who report to the employees
in temp_table. The results are added to temp_table using the UNION ALL.
SELECT indirect. employee_number
FROM temp_table direct, employee indirect
WHERE direct. employee_number = indirect. manager_employee_number
When no new rows are added to temp_table, the recursion terminates.
5
JGI JAIN
The final query, which is not part of the recursive WITH clause, retrieves employee data from temp_table:
SELECT * FROM temp_table ORDER BY employee_number;
The recursive query yielded the following results:
employee_number
1001
1002
1003
1004
1006
1008
1010
1011
1012
1014
1015
1016
1019
2.3.2 Using the RECURSIVE Clause in a CREATE VIEW Statement to Create a View
Using the RECURSIVE clause to create a view is analogous to using the WITH RECURSIVE clause to
precede a query.
Take a look at the employee table from Using the WITH RECURSIVE Clause. The following statement
generates the hierarchy 801 view, which uses a recursive query to retrieve the employee numbers of all
employees who report to the manager with employee number 801 directly or indirectly:
CREATE RECURSIVE VIEW hierarchy_801 (employee_number) AS
( SELECT root.employee_number FROM employee root WHERE root.manager_
employee_number = 801
UNION ALL
SELECT indirect.employee_number FROM hierarchy_801 direct, employee
indirect WHERE direct.employee_number = indirect.manager_employee_number
);
Use the SELECT command on the table to get the employee details:
SELECT * FROM hierarchy_801 ORDER BY employee_number;
The result of the given SQL query is as follows:
employee_number
1001
1002
1003
1004
1006
1008
1010
1011
1012
6
1014
1015
1016
1019
2.4 NESTED QUERIES

A query is written within another query in nested queries. The inner query’s result is used in the
execution of the outer query. For understanding nested queries, we will utilize the STUDENT, COURSE,
and STUDENT_COURSE databases.
Table 1 shows the STUDENT table:
Table 1: STUDENT Table
S_ID S_NAME S_ADDRESS S_PHONE S_AGE

S1 AMAN DELHI 9455123451 18
S2 RAMESH GURGAON 9652431543 18
S3 ANIKETT ROHTAK 9156253131 20
S4 ANNU DELHI 9156768971 18
Table 2 shows the COURSE table:
Table 2: COURSE Table
C_ID C_NAME
C1 DSA
C2 Programming
C3 DBMS
Table 3 shows the STUDENT_COURSE table:
Table 3: STUDENT_COURSE Table
S_ID C_ID
S1 C1
S1 C3
S2 C1
S3 C2
S4 C2
S4 C3
There are different kinds of nested queries, which are as follows:

 Nested independent queries: Query execution in independent nested inquiries begins with
the innermost questions and progresses to the outermost queries. The inner query is executed
independently of the outer query, but the inner query’s result is used in the execution of the outer
query. In constructing independent nested queries, many operators like IN, NOT IN, ANY, ALL, and
so on are employed.
7
JGI JAIN
 IN: If we want to find out which S_IDs are enrolled in C_NAME ‘DSA’ or ‘DBMS,’ we can use an
independent nested query and the IN operator. We can locate C_IDs for C_NAME ‘DSA’ or DBMS’ in
the COURSE table, and we can utilize these C_IDs to get S_IDs in the STUDENT_COURSE TABLE. The
steps for utilize these C_IDs to get S_IDs are as follows:
STEP 1: Determine C_ID for C_NAME = “DSA” or “DBMS”
Select C_ID from COURSE where C_NAME = 'DSA' or C_NAME = 'DBMS'
STEP 2: Using the C_ID from step 1 to obtain the S_ID
Select S_ID from STUDENT_COURSE where C_ID IN (SELECT C_ID from
COURSE where C_NAME = 'DSA' or C_NAME='DBMS');
The inner query will yield a set with members C1 and C3, while the outer query will return S_IDs
whose C_ID matches any member of the set (C1 and C3 in this case). As a result, it will return S1, S2,
and S4.
If we wish to find the names of STUDENTS who have enrolled in ‘DSA’ or ‘DBMS,’ we may do it as
follows:
Select S_NAME from STUDENT where S_ID IN (Select S_ID from STUDENT_
COURSE where C_ID IN (SELECT C_ID from COURSE where C_NAME='DSA'
or C_NAME='DBMS'));
 NOT IN: If we wish to locate the S_IDs of STUDENTS who have not enrolled in ‘DSA’ or ‘DBMS,’ we may
do it as follows:
S_ID FROM STUDENT WHERE S_ID IS NOT IN (S_ID FROM STUDENT_COURSE
WHERE C_ID IS IN (SELECT C_ID FROM COURSE WHERE C_NAME='DSA' OR C_
NAME='DBMS'));
The innermost query will return a set containing C1 and C3 elements. The second inner query will
yield S_IDs whose C_ID is equal to any member of the set (in this example, C1 and C3), which are S1,
S2, and S4. The outermost query will yield S_IDs that are not members of the set (S1, S2, and S4). As
a result, it will return S3.
 Nested queries with correlations: In co-related nested inquiries, the result of the inner query is
determined by the row presently being run in the outer query. For example, if we want to find out the
S_NAME of STUDENTS who are enrolled in C_ID ‘C1’, we may use a co-related nested query such as:
Select S_NAME from STUDENT S if it already exists (select * from
STUDENT_COURSE SC if S.S_ID=SC.S ID and SC.C_ ID='C1');
It will discover the rows from STUDENT_COURSE where S.S ID = SC.S_ID and SC.C_ID=’C1’ for each
row of STUDENT S. If there is at least one entry in STUDENT_COURSE SC with C_ ID=’C1’ for a S_
ID from STUDENT S, the inner query will return true, and the associated S_ID will be returned as
output.
2.5 SPECIFYING CONSTRAINTS AS ASSERTIONS

In SQL, users can use the DDL’s CREATE ASSERTION statement to create more general constraints,
such as attribute level or table level check constraints, that do not fall into any of the categories. Each
assertion is given a constraint name and a condition, similar to the WHERE-clause in a SQL query.
For example, to express the constraint that “an employee’s remuneration should not be greater than
the salary of the manager of the department for which the employee works,” we may use SQL to create
the following statement:
CREATE ASSERTION SALARY_CONSTRAINT
CHECK (NOT EXISTS (SELECT *
8
FROM EMPLOYEE E, EMPLOYEE M, DEPARTMENT D

WHERE E. SALARY>M.SALARYAND E.DNO=D.DNUMBER
ANDD. MGRSSN=M.SSN
));
The constraint name SALARY_CONSTRAINT is preceded by the keyword CHECK, which is followed by a
condition in parenthesis that must be true on every database state in the query above. The constraint
name can later be used to refer to the constraint, as well as to edit or remove it. The DBMS is in charge
of ensuring that the condition is not broken. The constraint is broken if any tuples in the database
cause the condition of an ASSERTION statement to evaluate too FALSE. If the constraint is violated, the
operation is aborted by DBMS.
2.6 EVENT CONDITION ACTION (ECA) MODEL (TRIGGERS) IN SQL

A trigger is a type of store procedure that executes in reaction to particular table operations such as
insertion, deletion, or updating of data. It is often helpful to indicate the sort of action to be done when
certain events occur and certain conditions are met. Consider the following scenario, in which the sales
manager must get mail whenever they receive a high-priority order. In this situation, a trigger may be
created to identify the priority of the received order and send an email to the sales manager.
The trigger is based on the ECA model, which stands for Event-Condition-Action. The description of
Event, Condition and Actions is as follows:
 Event: Events cause the rules to be activated. These events can be inserted, updated, or deleted
commands in a database. Changing an employee’s management is an event that can set off a trigger.
 Condition: The condition that decides whether the rule action is to be carried out. After the triggering
event, an optional condition may be examined. If no condition is given, the action is carried out
when the event happens. If a condition is supplied, it is first evaluated, and the rule action is only
done if it evaluates to true.
 Action: The action is often a set of SQL statements. It might alternatively be a database transaction
or external software that will be performed automatically. For example, when a specific event
happens, a stored process is executed.
2.7 CREATING VIEWS IN SQL

A view is nothing more than a SQL statement with a name that is saved in the database. A view is
essentially a table in the form of a preset SQL query.
A view can contain all of the rows in a table or only a subset of the rows in a table. A view can be built
from one or more tables, depending on the SQL query used to build the view.
Views, a form of the virtual table, enable users to achieve the following:
 Organize data in a way that consumers or groups of users find natural or intuitive.
 Restrict access to the data so that a user may only see and (sometimes) edit what they need.
 Compile data from various tables into summaries that may be used to make reports.
2.7.1 Creating Views

The CREATE VIEW statement is used to create database views. Views can be constructed using a single
table, several tables, or an existing view. A user must have the required system privileges for the specific
implementation in order to create a view.
9
JGI JAIN
The fundamental syntax for CREATE VIEW is as follows:

CREATE VIEW view_name AS
SELECT column1, column2.....
FROM table_name
WHERE [condition];
You can include numerous tables in your SELECT statement in the same manner that you would in a
standard SQL SELECT query.
2.8 DATABASE PROGRAMMING

Database programming is the process of creating and managing a database for an application.
Establishing links across multiple data sources and checking for mistakes and duplicate information
are examples of best practices. Another important task in database programming is retrieving data
instances from the database. The SQL programming language is used to retrieve, sometimes known
as query, specified data from a database. SQL (pronounced “sequel”) is an abbreviation for Structured
Query Language, which is used to create, update, retrieve, and remove data from a database.
Database programming is used to create and maintain the database of an app. This entails storing
and updating data to avoid duplicate record entries. Drawing linkages between tables are frequently
involved in larger applications with several tables of data. To some extent, all developers work with
databases. Learning database programming may help you organize your app’s data and keep it
functioning smoothly.
Database programming has a variety of applications, including:
 Creating a database from the ground up: This involves building all essential tables and table
connections. This guarantees that the full database is divided into smaller chunks.
 Making new records without duplicating them: By querying the database with SQL, we may look
for existing instances of data and update them rather than generating new ones.
A database for your app’s users is one example. A user’s address might be a field. If the user relocates,
it is preferable to just update the entry with the new location rather than generating a new record
entirely. This maintains the database structured and representative of the app’s current state.
 Obtaining and erasing records: We’ve discussed how to properly create and update entries. Using
SQL, we can also obtain certain data points and show them to the user. We can also use SQL to
remove records when a user deletes an account or no longer offers a certain product.
2.9 EMBEDDED SQL

SQL is the language we use to conduct database operations and transactions. When discussing industry-
level applications, we need adequately integrated systems that can extract data from a database and
show it to the user. In these circumstances, the embedded SQL comes to our aid.
We integrate SQL queries into high-level languages so that the logic element of our study may be readily
performed. The following are some notable instances of languages in which we incorporate SQL:
 C++
 Java
 Python etc.
10
Embedded SQL allows us to use databases as and when we need them. Several things must be taken care
of once the application we design is put into production.
We may simply use the database with the aid of query embedding without writing any hefty code. We
can use embedded SQL to construct APIs that can simply fetch and feed data as needed.
We need certain tools in each high-level language to use embedded SQL. In other circumstances, we
have built-in libraries that give us fundamental building blocks.
In certain circumstances, we must import or use other packages to complete the job at hand.
In Java, for example, we require a connection class. We start by creating a connection with the connection
class, and then we open the connection without providing the essential parameters to connect to the
database.
2.10 DYNAMIC SQL

Dynamic SQL is a programming approach that allows you to write SQL queries as they are running.
Dynamic SQL may be used to build general and versatile SQL queries.
he syntax for dynamic SQL is to string it out as follows:
'SELECT statement';
Run the stored procedure sp_executesql to execute a dynamic SQL query, as demonstrated below:
EXEC sp_executesql N'SELECT statement';
To utilise dynamic SQL as a Unicode string, use prefix N with sp_executesql.
The steps to use Dynamic SQL are as follows:
1. Declare two variables, @var1 for holding the name of the table and @var 2 for holding the dynamic
SQL:
DECLARE
@var1 NVARCHAR(MAX),
@var2 NVARCHAR(MAX);
2. Set the value of the @var1 variable to table_name:
SET @var1 = N'table_name';
3. Create the dynamic SQL by adding the SELECT statement to the table name parameter:
SET @var2= N'SELECT *
FROM ' + @var1;
4. Run the sp_executesql stored procedure by using the @var2 parameter:
EXEC sp_executesql @var2;
2.11 SQLJ
SQLJ is a language standard for encoding and executing a static SQL statement as an embedded SQL
statement in Java.
11
Figure 2 depicts the path of UAP development when SQLJ is utilized:
SQLJ
Source file
SQLJ Translator
Java
Source file
Java compiler
Class file Profile
Figure 2: Path of UAP Development

SQLJ is made up of two parts: SQLJ Translator and SQLJ Runtime Library. The description of these
libraries are as follows:
 SQLJInterpreter: SQLJ Translator examines an SQLJ source program and substitutes SQL statements
with normal Java instructions for SQLJ Runtime Library database access. SQLJ Translator creates
a Java source file as well as a profile that holds SQL data. The Java compiler is used by the user to
compile the Java source file into a class file (executable file).
 SQLJ Runtime Library: The SQLJ Runtime Library is used to run compiled class files. Depending on
the access interface, SQLJ Runtime Library may be utilized in one of two ways:
 Run the SQL statements after using the JDBC interface API (standard interface edition).
 Run the SQL statements by invoking an original interface (native interface edition) rather than
the JDBC interface.
2.12 STORED PROCEDURES IN SQL

Stored Procedures are programs that are written to conduct one or more DML operations on a database.
It is just a collection of SQL statements that receive input in the form of parameters, execute tasks, and
may or may not return a response.
12
The syntax for creating a procedure is as follow:

CREATE or REPLACE PROCEDURE name(parameters)
IS
variables;
BEGIN
//statements;
END;
The parameters are the most crucial component. Parameters are used to pass information to the
Procedure. There are three sorts of parameters, which are as follows:
 IN: This is the procedure’s Default Parameter. It always gets the values sent to it by the caller
application.
 OUT: The values are always sent to the caller application when this option is set.
 IN-OUT: This parameter carries out both procedures. It receives values from and transmits them to
the caller application.
2.13 LAB EXERCISE

1. For a given set of relation tables perform the following:
Creating Views (with and without check option), Dropping views, Selecting from a view
Ans. The SQL command for creating a table is as follows:
CREATE TABLE EMPLOYEE (
SSN VARCHAR2 (20) PRIMARY KEY,
FNAME VARCHAR2 (20),
LNAME VARCHAR2 (20),
ADDRESS VARCHAR2 (20),
SEX CHAR (1),
SALARY INTEGER,
SUPERSSN REFERENCES EMPLOYEE (SSN),
DNO REFERENCES DEPARTMENT (DNO));
The SQL command for inserting a value is as follows:
INSERT INTO EMPLOYEE (SSN, FNAME, LNAME, ADDRESS, SEX, SALARY) VALUES
('C
E01','YASH','KUMAR','MEERUT','M', 570000);
('SE01','DEEPAK','SHARMA','MEERUT','M', 800000);
('SE02','PARAG','DEWAN','MEERUT','M', 200000);
('S
E03','SHYAM','GUPTA','DELHI','M', 800000);
('SE04','PAWAN','SINGH','MANGALORE','M', 350000);
('S
E05','VARUN','MALYA','DELHI','M', 850000);
('SE06','SEEMA','SHARMA','MEERUT','F', 600000);
13
JGI JAIN
('CC01','YOGITA','SINGH','MANGALORE','F', 750000);
('CC02','SANTHOSH','VERMA','MANGALORE','M', 500000);
('SE01','VEENA','DEVI','DELHI','M', 800000);
('IT01','NAGESH','JAIN','MEERUT','M', 700000);
The SQL command for creating view (with and without check option) is as follows:
CREATE VIEW sales_staff AS 2 SELECT fname, ssn, dno 3 FROM employee
4 WHERE dno =5 5 WITH CHECK OPTION CONSTRAINT sales_staff_cnst; View
created.
The SQL command for selecting from a view is as follows:
select * from sales_staff;
The SQL command for droping a view is as follows:
DROP VIEW sales_staff;
2. Create a row level trigger for the customer table that would fire for INSERT or UPDATE or DELETE
operations performed on the CUSTOMER table. This trigger should display the salary difference
between the old values and new values.
Ans. The SQL command for creating a trigger are as follows:
CREATE OR REPLACE TRIGGER display_salary_changes
BEFORE DELETE OR INSERT OR UPDATE ON customers
FOR EACH ROW
WHEN (NEW.ID > 0)
DECLARE
sal_diff number;
BEGIN
sal_diff := :NEW.salary - :OLD.salary;
dbms_output.put_line('Old salary: ' || :OLD.salary);
dbms_output.put_line('New salary: ' || :NEW.salary);
dbms_output.put_line('Salary difference: ' || sal_diff);
END;
/
The SQL command for checking the salary difference by procedure are as follows:
DECLARE
total_rows number(2);
BEGIN
UPDATE customers
SET salary = salary + 5000;
IF sql%notfound THEN
dbms_output.put_line('no customers updated');
ELSIF sql%found THEN
total_rows := sql%rowcount;
dbms_output.put_line( total_rows || ' customers updated ');
END IF;
END;
/
14
 The database is becoming more interconnected, and this has produced a demand for standard
language that can be utilized in a variety of computer contexts.
 The GROUP BY clause is a SQL statement that is used to group rows with similar values. In the
SELECT query, the GROUP BY clause is utilized.
 A recursive query is a method of querying data hierarchies such as an organizational structure, bill-
of-materials, or document hierarchy.
 A query is written within another query in nested queries.
 In SQL, users can use the DDL’s Establish ASSERTION statement to create more general constraints,
such as attribute level or table level check constraints, that do not fall into any of the categories.
 A trigger is a type of store procedure that executes in reaction to particular table operations such as
insertion, deletion, or updating of Data.
 The trigger is based on the ECA model, which stands for Event-Condition-Action.
 A view is nothing more than a SQL statement with a name that is saved in the database.
 The Create VIEW statement is used to create database views.
 Database programming is the process of creating and managing a database for an application.
 SQL stands for Structured Query Language. It is the language we use to conduct database operations
and transactions.
 Dynamic SQL is a programming approach that allows you to write SQL queries as they are running.
 SQLJ is a language standard for encoding and executing a static SQL statement as an embedded
SQL statement in Java.
 Stored Procedures are programs that are written to conduct one or more DML operations on a
database.
2.15 GLOSSARY
 SQL: A language that is used to create, access, and manipulate database management systems
 Recursive query: A method of querying data hierarchies such as an organizational structure, bill-
of-materials, or document hierarchy
 Nested query: A query is written within another query
 Trigger: A type of store procedure that executes in reaction to particular table operations such as
insertion, deletion, or updating of data
 Database programming: The process of creating and managing a database for an application
 Dynamic SQL: A programming approach that allows you to write SQL queries as they are running
 Stored procedures: Programs that are written to conduct one or more DML operations on a database
15

1. Explain the concept of the GROUP BY clause.
2. What do you understand by recursive query?
3. Define the term nested query with example.
4. Explain the concept of trigger in SQL. Also, explain the EAC model.
5. A view is nothing more than a SQL statement with a name that is saved in the database. Discuss.
A. Hints for Essay Type Questions

1. The GROUP BY clause is a SQL statement that is used to group rows with similar values. In the
SELECT query, the GROUP BY clause is utilized. It is optionally used in combination with aggregate
methods to generate database summary reports. Refer to Section Complex Retrieval Queries using
Group By
2. A recursive query is a method of querying data hierarchies such as an organizational structure, bill-
of-materials, or document hierarchy. Refer to Section Recursive Queries
3. A query is written within another query in nested queries. The inner query’s result is used in the
execution of the outer query. Refer to Section Nested Queries
4. A trigger is a type of store procedure that executes in reaction to particular table operations such
as insertion, deletion, or updating of data. It is often helpful to indicate the sort of action to be done
when certain events occur and certain conditions are met.
5. A view is essentially a table in the form of a preset SQL query. A view can contain all of the rows in a
table or only a subset of the rows in a table. A view can be built from one or more tables, depending
on the SQL query used to build the view. Refer to Section Creating Views in SQL
 https://docs.microsoft.com/en-us/sql/relational-databases/views/views?view=sql-server-ver15
 https://www.temok.com/blog/static-vs-dynamic-sql/
 Discuss with your classmates about the concept of trigger, dynamic SQL, and embedded SQL.
16
UNIT
Advanced Transaction Processing
& Recovery
Names of Sub-Units
Review of ACID Properties and Serializability, Multiversion Concurrency Control Techniques,

Granularity of Data Items and Multiple Granularity Locking, Advanced Database Recovery Techniques
Overview
This unit begins by discussing the concept of advanced transaction processing & recovery and a review
of ACID properties and serializability. Next, the unit describes the multiversion concurrency control
techniques and granularity of data items. Further, the unit explains the multiple granularity locking.
Towards the end, the unit covers the advanced database recovery techniques.
Learning Objectives

 Discuss the concept of advanced transaction processing & recovery
 Explain the concept of review of ACID properties and serializability
 Describe the multi-version concurrency control techniques and granularity and data item
 Explain the significance of multiple granularity locking
 Outline the advanced database recovery techniques
JGI JAIN
Learning Outcomes

 Evaluate the concept of advanced transaction processing & recovery
 Assess the concept of review of ACID properties and serializability
 Analyse the importance of multi-version concurrency control techniques and granularity and
data item
 Determine the significance of multiple granularity locking
 Explore the advanced database recovery techniques
 https://www.uio.no/studier/emner/matnat/ifi/INF5100/h04/undervisningsmateriale/
HANDOUTS/L05-transaction-models.pdf
3.1 INTRODUCTION
Transaction is a unit of database operation. It can be a read operation, a write operation, or a read/
write operation. It can be said a transaction is a sequence of many actions grouped and considered as
an atomic unit of work. A query written in Structured Query Language (SQL) may involve more than one
transaction that includes reading or writing the data from various tables. Every day, many database
systems deal with a large number of transactions renamed the database system as a transaction
processing systems.
A transaction can involve four types of operations, which are described as follows:
 READ (A): Reads data from the database tables that are already stored in the memory, where A
signifies the record that is to be read
 WRITE (A): Writes the record A that may be the result of a transaction back to the main memory or
to the disk
 COMMIT: Makes sure that all the changes made by the transactions are made permanent
 ROLLBACK: Terminates the transaction in case of an unsuccessful transaction and undoes any
changes made in the database
Let’s take an example to clarify the idea about how transaction takes place in case of a library
management system when a user is about to borrow a book B from the library online. Let’s assume
that the account number for the customer is A. In the following example, the code is written in pseudo-
language:
BEGIN
READ A
IF A Exist Then
Look for the Availability of the Book B
IF B is available Then
Assign the book to user A
Set the borrow_period
2
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
WRITE A.borrow_period;
WRITE B.borrow_period;
COMMIT
ELSE
BEGIN
DISPLAY "Book is not available,"
ROLLBACK
END
ELSE
DISPLAY "Account Does Not Exist."
END
In the preceding example, the aforementioned four operations are used to make you understand how
these operations take place in a transaction. This example shows how a user can borrow a book online by
checking the availability of the book. In this case, the user database as well as the database maintained
for the books has been updated. After the successful updating of data, the database makes the changes
permanent by executing the operation. Otherwise, it rolls back the operation in case of an unsuccessful
operation.
3.2 REVIEW OF ACID PROPERTIES AND SERIALIZABILITY

All transactions in a database follow certain properties, which help them to remain efficient. These
properties are called Atomicity-Consistency-Isolation-Durability properties or in short ACID properties.
These properties are defined in detail as follows:
 Atomicity: Ensures that the whole unit of the transaction should be performed at a time. It can
be said that a transaction should be done completely or not at all. If the example of pseudo code
is considered, both read and write operations have been performed on the data. Therefore, the
transaction needs to be performed completely; otherwise, the whole transaction should be rolled
back. This concept of atomicity imposed on every transaction helps maintain the log of every
transaction.
 Consistency: Means if the database is considered on which a transaction is about to be performed is
in State_1, then after the successful completion of the transaction, the state of the database should
be moved to State_2. This defines the consistency property, which ensures that the transaction
commit operation always transforms a database from one consistent state to another consistent
state. When a transaction fails, this property ensures that the database should return to its previous
consistent state.
 Isolation: Refers to a very important property since it helps to execute the transaction without fail
or any error. This property ensures that all the changes should be reflected on the database until
the transaction is committed. Isolation guarantees that the concurrent execution of the transaction
on the same database and the same table should not affect all other transactions. For example, in
the library management system, if user A places a request to buy book B and at the same time, the
administer places a request that changes the book name B to B1, then this change (whether fail or
commit) should not reflect the first transaction done by user A. Only the state that has been read by
transaction 1 should determine the result of the transaction.
 Durability: Confirms that the changes made by the successful execution should be made permanent.
Thus, the changes should never be lost due to any kind of failure. It can be said that transaction is
a basic unit of recovery. In a database, a log has been maintained that keeps the record of every
committed transaction.
3
JGI JAIN
3.2.1 Serializability
Serializability theory in DBMS deals with maintaining the correctness of concurrent execution. When two
transactions are executed concurrently, there is no way to determine whether any problem has occurred
or not. Serializability provides us a means to schedule the transaction so that these twotransactions
never conflict with each other. To know more about the rule of serializability, it is required to know about
a serial schedule from where the name is derived and also about the equivalent schedule.
Serial Schedule
Serial schedule is a type of schedule where no transaction ever interleaved with each other, i.e., after
the complete execution of a transaction, another transaction begins its execution. Table 1 presents an
example of a serial schedule:
Table 1: Example of Serial Schedule
Schedule 1: T2 followed by T1 Schedule 2: T1 followed by T2
T1 T2 T1 T2
READ A READ A
A : = A – 250 READ B
WRITE A SHOW A + B
READ B READ A
B : = B + 250 A : = A – 250
WRITE B WRITE A
READ A READ B
READ B B := B + 250
SHOW A + B WRITE B
Equivalent Schedule
Two schedules can be an equivalent schedule if the execution of one schedule leaves the database in a
state, which is similar to the state acquired after the execution of another transaction. Table 2 presents
an example of an equivalent schedule:
Table 2: Example of Equivalent Schedule
Schedule 1 Schedule 2
T1 T2 T1 T2
READ A READ A
A := A – 250 A := A – 250
WRITE A WRITE A
READ B READ A
B := B + 250 A: = A*10
WRITE B WRITE A
4
Schedule 1 Schedule 2
READ A READ B
A: = A*10 B := B + 250
WRITE A WRITE B
READ B READ B
B:= B/5 B:= B/5
WRITE B WRITE B
3.2.2 Conflict Serializability

There are two varieties of serializability in a serializable schedule. In this section, you will learn about
one of the two types of serializability known as conflict serializability. To understand the concept of
conflict serializability, let’s take the example presented in Table 3:
Table 3: Example of Conflict Serializability
Schedule S1
T1 T2
READ A
WRITE A
READ A
WRITE A
READ B
WRITE B
READ B
WRITE B
The preceding schedule shows an interleaved transaction. Each transaction is divided into small unit
of transactions, such as t(1) and t(2) for transactions T1 and T2, respectively. The rule of conflict
serializability states that a schedule can be conflict serializable if the following conditions hold:
 The two operations (READ or WRITE) in a schedule belong to different transactions
 The two operations access the same data item
 One of the two operations is a WRITE operation
From the preceding schedule, the following cases can be considered and detect the conflict serializability
in the following cases:
 Case 1: If t(1) is READ A and t(2) is also READ A, then they read the same data item A; however, none
of them interferes with other operations. Thus, they perform independently and this interleaved
operation does not create any conflict. Here, the order of execution does not matter.
 Case 2: If t(1) is WRITE A and t(2) is READ A, then the order of execution does matter since the value
of A written by t(1) can be read by t(2) if t(1) is executed before t(2). If t(2) is executed before t(1), then
t(2) reads the value that is previously in the database. Thus, this execution can raise a conflict.
5
JGI JAIN
 Case 3: If t(1) is WRITE A and t(2) is WRITE A, then the order of execution does matter since the write
operation, which is executed at last, should write its value to the database that retains for the next
operation. If t(1) executes at first, then the value written by t(2) should be the value of A after the
execution of t(2) and if t(2) executes first, then the value written by t(1) should be the final value of A.
Thus, it can be said that for the second case, conflict may arise between t(1) and t(2) since they operate
on the same data and one of them is a WRITE operation. The conflicting operations in schedule S1 are
presented in Table 4:
Table 4: Example of Conflicting Operations
Schedule S1
T1 T2
READ A
WRITE A
READ A
WRITE A
READ B
WRITE B
READ B
WRITE B
“A schedule is called conflict serializable if it is conflict equivalent to some serial schedule.”
Now the concern is what is conflict equivalent? Conflict equivalence between two schedules S1 and S2
can be defined as follows:
“Schedule S1 is said to be conflict equivalent with schedule S2 if they perform the same actions of the
transaction on the same database item and every pair of conflicting operations should be in the same
order.”
Table 5 presents an example of conflict equivalence, where the order of conflicting operations in schedule
S1 is in the same order as that in S2:
Table 5: Example of Conflict Equivalence
Schedule S1 Schedule S2
T1 T2 T1 T2
READ A READ A
WRITE A WRITE A
READ A READ B
WRITE A WRITE B
READ B READ A
WRITE B WRITE A
READ B READ B
WRITE B WRITE B
6
3.2.3 View Serializability

Another definition of serializability is view serializability, which has less stringent rules than the rules
applied for conflict serializability. Two schedules S1 and S2 are said to be view equivalent if the following
conditions hold:
 If the transaction t(i) in schedule S1 reads the initial value of data item A, then the transaction t(i) in
schedule S2 should also read the initial value of A.
 If the transaction t(i) in schedule S1 reads the value of A that has been written by its previous
transaction, then t(j) in schedule S2 should also read the value of A that has been read by transaction
t(i) in schedule S1. This operation ensures that each transaction reads the same value of A in both
the schedules.
 If the transaction t(i) in schedule S1 writes the final value of A, then the transaction t(i) in schedule
S2 should also write the final value of A. This operation ensures that the final results in both the
schedules are the same.
Now view serializability on the notion of view equivalence is defined as follows:
“A schedule S is called view serializable iff it is view equivalent to some serial schedule.”
Table 6 (a) presents an example of a view equivalent schedule:
Table 6 (b) Example of view equivalence:
Table 6(a): View Equivalence of Schedule S1
Schedule S1
T1 T2 T3
READ A
WRITE A
WRITE A
WRITE A
Table 6(b): View Equivalence of Schedule S2
Schedule S2
T1 T2 T3
READ A
WRITE A
WRITE A
WRITE A
It can be noticed from the preceding two tables that in schedule S1, T1 and T3 write data that have
not been read. This type of writing is known as blind write and appears only in the view serializable
schedule, which is not conflict serializable, because view-serializable schedule handles blind write more
effectively.
The two schedules S1 and S2 are not conflict equivalent, but they are view equivalent. Thus, it is not
necessary that any view-serializable schedule is conflict serializable. However, every conflict-serializable
schedule is view serializable. It is easier to implement conflict serializability than view serializability.
7
JGI JAIN
3.3 MULTIVERSION CONCURRENCY CONTROL TECHNIQUES

The two concurrency control protocols 2PL and timestamp-based protocol, where a conflicting READ
operation needs to wait until the data value has been updated. However, the concurrency of the READ
operation can maintained by keeping a current copy or version of the data item that is to be read. This
scheme is known as multi-version concurrency control scheme, where no READ requests are delayed.
For a single data item A, many versions have been maintained and depending on the request, any version
of A has been selected for READ operation. This is the task of concurrency control manager.
Whenever a READ request for a data item, say A, has been submitted, it will be granted immediately
and this operation reads the current copy of A. The system must ensure that the READ operation must
read one of the versions of the data item to maintain serializability. According to this scheme, a WRITE
operation on data item A creates a new copy or version of A and the READ operation selects one of the
versions of A to be read.
3.3.1 Multi-Version Timestamp-Based Protocol

Timestamp is the unit that uniquely identifies a transaction and before the execution of a transaction,
timestamp has been assigned to that transaction. In this protocol, multiple versions of a data item A
have been maintained and each version of A consists of the following three perimeters:
 Value of Ai: Stores the value of the ith version of A
 WRITE_TIMESTAMP (Ai): Refers to the timestamp value of a transaction Ti that writes the value of Ai
 READ_TIMESTAMP( Ai): Refers to the largest timestamp value of any transaction that reads the
data item Ai
According to this protocol, whenever a new version of A has been created by a transaction Tk, the values
of WRITE_TIMESTAMP and READ_TIMESTAMP have been set to TIMESTAMP (Tk).
Also, whenever a transaction Tn is allowed to read a data item Aj, the value of READ_TIMESTAMP has
been set to the maximum of READ_TIMESTAMP(Aj) or TIMESTAMP (Tn).
Next, consider serializability is considered. To maintain serializability, this protocol follow two rules,
which are described as follows:
 IF transaction T issues WRITE (A) && Max(WRITE_TIMESTMAP (Xi)) < = TIMESTAMP
(T) && READ_TIMESTAMP () > TIMESTMAP (T)
THEN ABORT and ROLLBACK T;
ELSE
Create a new version of A such that READ_TIMESTAMP ()= WRITE_TIMESTMAP () = TIMESTAMP (T).
This rule states that if a transaction T issues a WRITE (A) operation and version I of A has the highest
value of WRITE_TIMESTAMP (), which is less than or equal to timestamp of T and also if READ_
TIMESTAMP is younger than the timestamp of transaction T, then abort or rollback the operation.
Otherwise, create a new version of A such that read timestamp, write timestamp and timestamp of
transaction T are same.
 The second rule state that if a transaction T issues READ (A) operation, then find the version I of A
that holds the maximum value of READ_TIMESTAMP, such that
READ_TIMESTAMP < = TIMESTAMP (T). Then return the value of version Ai and set the READ_
TIMESTAMP to the value maximum of READ_TIMESTAMP () and TIMESTAMP (T).
8
According to the rule stated in rule 1, transaction T can be rolled back. This is because if transaction T
is attempting to write a version of A, which has been read by some other transaction at the same time,
then this causes conflict. Therefore, T needs to be rolled back.
In addition, from rule 2, it can be concluded that the READ operation following this protocol will always
be successful since it reads the latest version of the data item by comparing the WRITE_TIMESTAMP of
various versions of the same data item.
However, one major problem of this protocol is that it allows cascading rollback. To avoid this scenario,
each WRITE operation needs to be completed before its timestamp has been associated with a data item.
This reduces the risk of cascading rollback. It also needs to update three parameter to update a version
that requires two disk accesses.
3.3.2 Multi-Version 2PL Protocol

Multi-version 2PL protocol is based on the 2PL protocol and multi-version timestamp-based protocol.
It utilises the advantages of both these schemes, where many versions of a data item have also been
maintained. In this protocol, a transaction is divided into two types of transaction: update transaction
and read-only transaction. Update transaction implements 2PL protocol by acquiring all the lock before
the execution of the transaction. If an update transaction issues a READ operation on the data item
A, a shared lock on A has been granted and if it issues a WRITE operation, then an exclusive lock has been
granted by the system. On the other hand, read-only transaction follows multi-version timestamp- based
protocol and when the read-only transaction issues a READ operation on the data item A, the latest
version of A has been read by the transaction.
As a timestamp, an integer counter is used that increases its value after each commit transaction
statement. Let T1 and T2 be update transaction and read-only transaction, respectively and the counter
variable holds a value i. Now, if T1 and T2 both try to access the same data item A and if TIMESTAMP (T1)
< TIMESTAMP (T2), then T2 will access the data item Ai with the timestamp i if T1 has not committed;
otherwise, it will access data item Ai+1 with timestamp value i+1. When a WRITE operation is performed,
a new version has been created and the counter variable value has been then increased by 1. As a result
of this whole operation, the read-only transaction always reads a value from the current version and it
never needs to wait.
Old versions of data are deleted depending on their timestamp values; a data item always maintains
its latest version of both the READ_TIMESTAMP and WRITE_TIMESTAMP values. Since it follows 2PL
protocol, cascading rollback can be avoided and serializability can be maintained. For commercial use,
this protocol has mostly been applied.
3.4 GRANULARITY OF DATA ITEMS AND MULTIPLE GRANULARITY LOCKING

All concurrency control strategies presume that the database contains a set of named data objects. A
database item might be one of the following options:
 A database entry
 A database record’s field value
 A section of a disc
 A whole file
 The entire database
9
JGI JAIN
3.4.1 Granularity Level Considerations for Locking

Data item granularity refers to the size of data pieces. Small-item sizes are referred to as fine granularity,
whereas big item sizes are referred to as coarse granularity. Several considerations must be considered
while determining data item size. Although comparable arguments may be made for other concurrency
control strategies, data item size will be considered in the context of locking.
To begin, keep in mind that the higher the data item size, the lower the degree of concurrency allowed. If
the data item size is a disc block, for example, a transaction T that has to lock a record B must lock the
whole disc block X that contains B since a lock is associated with the entire data item (block). If another
transaction S tries to lock a different record C in the same block X in a conflicting lock mode, it is forced
to wait. Transaction S would be allowed to proceed if the data item size was a single record since it would
be locking a separate data item (a record). The smaller the data item size, on the other hand, the greater
the number of objects in the database. Because each item is connected with a lock, the system will have
a greater number of active locks for the lock manager to maintain. There will be more lock and unlock
procedures, resulting in a larger overhead. Furthermore, extra storage space will be needed for the lock
table. Storage for the read_TS and write_TS for each data item is required for timestamps and there will
be a corresponding overhead for handling a large number of items. Given the preceding considerations,
the obvious issue is the optimum item size. The answer is that it depends on the type of transaction. When
a normal transaction accesses a limited number of records, having the data item granularity setto one
record is useful. If, on the other hand, a transaction often accesses several records in the samefile, it
may be preferable to have a block or file granularity such that the transaction treats all of those records
as a single (or a few) data item.
3.4.2 Multiple Granularity Level Locking

In concurrent transactions, locks have been acquired on a single data item. However, in case of a
transaction that needs to lock the entire database or some part of the database, it is required to put a
lock on the entire data items that a database or the part of the database holds, which is quite inefficient.
Another point is that if a long transaction requires many data items to lock, then the lock manager needs
to maintain a lock table whose size is quite high. Moreover, many locking and unlocking operations
need to be performed, which increase the overhead of the system. Therefore, the entire database can be
divided into level of granularity, which enhances the performance of locking of the database and also
improves the recovery process. For the sake of simplicity, the entire database can be divided in four levels
and represent the level of granularity in the form of a tree with four levels. The tree has its root node
that holds the entire database; the first layer consists of a number of disk blocks to which the database
can be divided; the second layer consists of a number of files that each disk block holds; and the leaf
node of the tree consists of the records that each file contains. Figure 1 represents the granularity tree:
DB
DB2 DBN
FNA FNB FN
R1AN RNA1 RNAN
Figure 1: Granularity Tree
10
The data item size is often called data item granularity. The fine granularity is referred to as the leaf level
nodes that contain the records or data items. On the other hand, the coarse granularity is referred to as
the nodes that consist of the number of descendants. If a transaction considers fine granularity, then
the degree of concurrency is high compared to the degree of concurrency that results in a transaction,
which considers coarse granularity.
Now the concern is what should be the best size of a data item? The answer is that it depends on the size
of the transaction. If a transaction is small, then it is feasible to hold the locks on the single data items;
if the transaction is long, then it would be feasible to hold the lock on a particular block in the database
that contains the data items, which are required to be locked.
In the next subsection, the multiple granularity 2PL protocol will be devised that helps to maintain
serializability and concurrency in the schedule.
Next, let’s see how locks on the data item implementing granularity can be acquired? If a transaction T1
wants to lock the entire database, then it will be practical to lock the entire database instead of locking
each and every data item in the database. If another transaction T2 wants to lock the data item that
resides in the database, then it is required to traverse the granularity tree from root to leaf to find any
conflicting operations for the same data item that T2 is requested for. If there exists any conflicting
operation on the same data item, then T2 is forced to wait.
It can be implemented the concept of granularity in 2PL protocol, which enhances the performance of
2PL protocol. In this protocol, the transactions can acquire the locks at any level of the granularity tree
instead of locking each item under the node.
3.4.3 Multiple-Granularity Locking Protocol

Before describing multiple-granularity locking protocol, it is needed to devise a new type of locking
known as intensive locking. Let’s consider that only exclusive locks and shared locks are allowed in this
protocol and also take two transactions T1 and T2 such that T1 acquires an exclusive lock on disk block
DB1, whereas T2 requests for a shared lock on the record R1A2.
There are two conditions, where in one condition, T1’s request is processed first and in the other condition,
T2’s request is processed first. The impact of both these conditions is described in DBMS as follows:
 Condition 1: If T1’s request is submitted first.
Since T1 requests to update the disk block DB1, so an exclusive lock on DB1 has been granted by the
system instead of granting an exclusive lock on each and every data item that falls under this disk
block, Thus, when T2’s request has been submitted for reading the data item R1A2, which resides
in the same disk block DB1, the system needs to traverse the sub-tree whose root is DB1 to find any
conflicting locks held by the same data item R1A2. This is required to verify the compatibility of
the requested lock. Now, if any conflicting lock has been acquired by the data item R1A2, then T2 is
forced to wait; otherwise, the shared lock that is requested by T1, has been granted by the system.
 Condition 2: If T2’s request is submitted first.
T2 is requested for a shared lock on the data item R1A2, which is one of the leaf nodes of the
granularity tree. Thus, when this request has been submitted to the lock manager, it has been granted
immediately. After this, when T1 requests for an exclusive lock on the disk block DB1 holdingthe data
item R1A2, the lock manager needs to traverse each node that is descendant of DB1 at everylevel to
find any conflicting locks. This procedure is quite inefficient regarding the implementation of
granularity tree.
Thus, it can be concluded that condition 2 defeats the purpose of granularity. For this reason, a new
type of lock has been implemented, which is known as intensive lock. The idea of intension lock is to
11
JGI JAIN
indicate that the type of locks hold by the nodes lies at the higher level of the desired node, which is locked
explicitly in the granularity tree. For example, if a transaction desires to lock node F1Ato update some of
its descendants, then all its ancestors’ nodes in the path from the root to F1A should contain an intension
lock of exclusive. There are three types of intension locks, which are described as follows:
 Intension shared lock or IS lock: Indicates that shared locks have been implemented explicitly on all
the descendants at a lower level.
 Intension exclusive lock or IX lock: Indicates that exclusive locks or shared locks have been
implemented explicitly on all the descendants at a lower level.
 Shared and intension exclusive lock or SIX lock: Indicates that shared locks have been implemented
explicitly on the root of the sub-tree and exclusive locks have been implemented on all the descendants
at a lower level. For example, if a transaction requests for a SIX lock on F1A, then F1A node acquires
a shared lock and all its descendant nodes acquire exclusive locks.
To map the compatibility between different types of locking, compatibility matrix is used, which is shown
in Table 7:
Table 7: Compatibility Matrix
IS IX S SIX X
IS TRUE TRUE TRUE TRUE FALSE
IX TRUE TRUE FALSE FALSE FALSE
S TRUE FALSE TRUE FALSE FALSE
SIX TRUE FALSE FALSE FALSE FALSE
X FALSE FALSE FALSE FALSE FALSE
This matrix is helpful for developing the multiple granularity 2PL protocol. This protocol consists of the
following rules:
 Compatibility matrix should be considered while making any decision
 Root of granularity tree must be locked first and it can be locked in any mode
 A node Ni in the granularity tree can be locked by any transaction Ti in S or IS mode only if the
parent node of Ni has already been locked by Ti in either IX or IS mode
 A node Ni in the granularity tree can be locked by any transaction Ti in X, SIX, or IX mode only if the
parent node of Ni has already been locked by Ti in either IX or SIX mode
 A transaction Ti can lock a node if it has not previously unlocked any node (This implements 2PL
protocol)
 A transaction Ti can unlock a node Ni if none of the child nodes of Ni is currently locked by Ti
From the above-stated rules, it can be emphasised that the first rule has been implemented to protect any
conflicting operations to occur, whereas rules 2, 3 and 4 are there to state the mode of locking of other
nodes of the sub-tree when a node has been acquired a lock. Rules 5 and 6 enforce the 2PL protocol.It can
be seen that the locks in the granularity tree have been acquired in top-down manner, whereas the locks
have been released in bottom-up approach.
Let transaction T1 needs to read all the data items of the disk block DB1. Thus, T1 needs to lock the
database DB in IS mode and DB1 in S mode. Take another transaction T2 that wants to update the record
R1A5. It needs to acquire IX locks for the databases DB1 and F1A. On the other hand, a third transaction
T3 wants to update the entire file F3A. For this purpose, it needs to lock the databases DB1 in IX mode and
12
F3A in X mode. If these three transactions are considered, then it can be concluded that transactions
T2 and T3 can take place concurrently since they are operating on different data items in the same disk
block; however, transaction T1 can never be executed concurrently with T2 and T3. Thus, it can be said
that this protocol enhances the type of concurrency that is applied on the transactions and reduces the
lock overhead. The lock manager in this case does not need to maintain so many locking and unlocking
operations. This protocol is especially designed for the following purposes:
 Short transactions that request for few data item
 Long transactions that access an entire file in case of generating reports
3.5 ADVANCED DATABASE RECOVERY TECHNIQUES

The database is vulnerable to failures caused by inconsistency, network failure, errors, or any other type
of unintended harm. As a result, database recovery procedures are critical for restoring a database to
operational status following a disaster. In the Database, four distinct recovery procedures are accessible.
3.5.1 Recovery Concept

Recovery from transaction failures often entails restoring the database to the most recent consistent
state just prior to the moment of failure. To do this, the system must keep track of the changes made to
data objects by the different transactions. As previously stated, this information is normally saved in
the system log. Informally, a typical recovery plan may be characterised as follows:
 If a large portion of the database suffers extensive damage as a result of a catastrophic failure, such
as a disc crash, the recovery method restores a previous copy of the database that was backed up to
archival storage (typically tape or another large capacity offline storage media) and reconstructs
a more current state by reapplying or redoing the operations of committed transactions from the
backed-up log, up to the time of failure.
 When the database on the disc is not physically destroyed and a non-catastrophic failure of type
1 through 4 occurs, the recovery technique is to detect any changes that may create a database
inconsistency. A transaction that has altered certain database entries on disc but has not been
committed, for example, must have its modifications reversed by redoing its write operations. Part
actions may also need to be redone to restore the database to a consistent state; for example, if a
transaction has been committed but some of its write operations have not yet been written to the
disc. The recovery methodology does not need a complete archival copy of the database for non-
catastrophic failure. Instead, the entries in the online system log-on disc are evaluated to identify
the proper recovery procedures.
Deferred updates and instantaneous updates are the two basic strategies for recovering from
non-catastrophic transaction failures. Deferred update approaches do not physically update the
database on disc until a transaction reaches the commit point, at which point the modifications are
saved in the database. All transaction modifications are stored in the local transaction workspace or
in the main memory buffers that the DBMS maintains prior to committing (the DBMS main memory
cache). The updates are logged persistently in the log before committing and then the modifications
are written to the database on disc following commit. If a transaction fails before it reaches its
commit point, it has not modified the database in any manner and hence UNDO is not required. It
may be essential to REDO the impact of a committed transaction’s actions from the log because their
effect has not yet been captured in the database on disc. As a result, the deferred update technique
is also known as the NO-UNDO/REDO algorithm.
The database may be updated by some activities of a transaction before the transaction reaches
its commit point in the immediate update approache. These activities, however, must also be
13
JGI JAIN
documented in the log-on disc by force-writing before they are applied to the database on disc,
allowing for recovery. If a transaction fails after recording certain changes in the database on disc but
before reaching its commit point, the effect of the transaction’s activities on the database must be
reversed. In other words, the transaction must be reversed. During recovery, both undo and redomay
be necessary in the case of an instantaneous update. This method, known as the UNDO/REDO
algorithm, necessitates both procedures during recovery and is most commonly utilised in practice.
The UNDO/NO-REDO method is a variant of the algorithm in which all modifications must be saved
in the database on disc before a transaction commits, requiring just undo.
The UNDO and REDO operations must be idempotent, which means that doing an operation
numerous time is identical to performing it just once. In fact, the entire recovery process should be
idempotent because if the system fails during the recovery process, the subsequent recovery attempt
may UNDO and REDO some write_item operations that were previously performed during the initial
recovery process. The outcome of recovering from a system crash during recovering should be the
same as recovering when there is no crash while recovering!
3.5.2 Caching (Buffering) of Disk Blocks

The recovery process is frequently inextricably linked to operating system processes, most notably the
buffering of database disc pages in the DBMS main memory cache. Multiple disc pages containing the data
to be updated are often cached into main memory buffers and then updated in memory before being
pushed back to disc. Caching disc pages is normally an operating system function, but becauseit is
critical to the performance of recovery operations, it is handled by the DBMS by using low-level operating
system functions.
In general, it is more convenient to think of recovery in terms of database disc pages (blocks). For the
purpose of keeping these buffers, a collection of in-memory buffers known as the DBMS cache is often
kept under the control of the DBMS. To keep track of which database objects are in the buffers, a cache
directory is employed. 1 This might be a list of entries. When the DBMS demands action on an item, it first
examines the cache directory to see if the disc page containing the item is already in the DBMS cache.
If it is not, the object must be found on the disc and the relevant disc pages are put into the cache.
To create room for the new item, it may be required to replace (or flush) some of the cache buffers.
To choose the buffers for replacement, several page replacement algorithms are comparable to those
used in operating systems, such as least recently used (LRU) or first-in-first-out (FIFO), or a new DBMS-
specific approach, such as DBMIN or Least-Likely-to-Use, can be employed (see bibliographic notes).
The entries in the DBMS cache directory include extra buffer management information. A dirty bit is
associated with each buffer in the cache and can be included in the directory entry to indicate whether or
not the buffer has been updated. When a page is initially read from the database disc into a cache buffer,
a new entry with the new disc page address is added to the cache directory and the dirty bit isset to 0.
(zero). The dirty bit for the associated directory item is set to 1 as soon as the buffer is updated (one).
The directory can additionally store other information, such as the transaction id(s) of the transaction(s)
that updated the buffer. When the contents of a buffer are replaced (flushed) from the cache, the contents
must first be written back to the associated disc page if its dirty bit is 1. Another bit, known as the pin-
unpin bit, is required—if a page in the cache cannot yet be pushed back to disc, it is pinned (bit value 1
(one)). The recovery protocol, for example, may prevent particular buffer pages frombeing written back
to disc until the transactions that altered this buffer have been committed.
When flushing a changed buffer back to the disc, there are two options. The first technique, known as in-
place updating, writes the buffer to the same original disc location, overwriting any altered data
items on the disc with their previous values. 2 As a result, only one duplicate of each database disc block
14
is kept. The second option, known as shadowing, writes an updated buffer to a new disc location to
preserve several copies of data items, however, this approach is rarely utilised in reality. In general, the
previous value of the data item prior to updating is referred to as the before image (BFIM) and the new
value after updating is referred to as the after image (AFIM). If shadowing is employed, both the BFIM
and the AFIM can be retained on disc; hence, keeping a log for recovery is not necessarily essential.
3.5.3 Write-Ahead Logging (WAL)

WAL is a log-based file system recovery mechanism that writes changes to permanent data to a log file
called the WAL log. The change records (Redo records) in the log file indicate the modifications. The WAL
log is forced to disc at critical points, such as transaction commit. Redo records can be used in the event
of a reset or crash to change an old copy of a permanent data block on the disc into the version that was
in memory at the time of the reset.
By keeping the WAL log, the updated permanent data blocks are no longer required to be written to disc
when each block is modified. The only entries in the WAL log that must be written to disc are the Redo
records. This enables the maintenance of a write cache of permanent data blocks. WAL protects all
permanent and system tables, but not the Transient Journal (TJ), because TJ data are placed in the WAL
log. WAL is also not utilised to safeguard spool or volatile tables.
The WAL log is kept apart from the usual table area as a logical file system. The WAL log is assigned whole
cylinders and has its index structure. WAL log data is a collection of WAL log entries that comprises the
following:
 Redo records, which are used to update disc blocks and maintain file system consistency during
restarts.
 TJ records are utilised to roll back transactions.
The WAL log suffers from many of the same issues as the standard file system, such as data corruption
caused by hardware, software and operational issues. The Filer tool includes specific diagnostic starting
options and instructions that may show and change the WAL log and its index, allowing support
professionals to troubleshoot such issues. Figure 2 depicts WAL:
Update employee got Transaction

designation’ Constraint
Transaction to entry Write- Ahead Logging (WAL)
is made before date to
copied to
Required pages are Check point

copied to this buffer
cache from the disk
Checkpoint writes at dirty

Disk pages to the desk
Buffer Cache
Figure 2: Write-Ahead Logging (WAL)
15
JGI JAIN
3.5.4 Algorithms for Recovery and Isolation Exploiting Semantics (ARIES)

The most advanced and most difficult recovery algorithm is ARIES the recovery algorithm. The
advantage of this recovery technique is that it reduces the recovery time as well as the overhead of
a log. However, all these can be possible at the expense of difficulty. This algorithm uses techniques
that are difficult to understand. However, in this sub-section, the simpler version is presented to ease
the understanding of this algorithm. Before describing this algorithm, it is required to elaborate few
concepts, which are as follows:
 Log Sequence Number (LSN): Refers to a pointer used to identify the log records.
 Dirty page table: Refers to the pages with their updated version placed in the main memory and the
disk version of it is not updated. A table has been maintained, which keeps the records of all the dirty
pages are known as dirty page tables. This is helpful in reducing the unnecessary redo operations.
 Fuzzy checkpoint: Means instead of using normal checkpoints, this algorithm uses fuzzy checkpoints
that maximise concurrency in transactions.
In ARIES, a log consists of multiple log files. Each file has a specific capacity. If the first log file grows to
some limit, then further log records are appended to a new log file. Each log file maintains a file number,
which monotonically increases. Each of these log files has been associated with an LSN, which is the
combination of file number and the offset of that particular file. Each file contains a particular data
page and the latest event related to that data page. Thus, the corresponding LSN of a log file indicates
particular operation of the stored data page during the recovery procedure.
If a log record stores the log of transaction T, then that log record also needs to maintain the LSN of
the previous log, which holds the previous state of transaction T1. Thus, the current log record stores
the previous LSN as Previous_LSN, which helps to move the transaction backward without reading the
whole log.
A log record has been written to maintain five types of actions: WRITE COMMIT, ABORT, UNDO and END.
When an update operation on a data item has been undone, a compensation log record has been written
in the log, whereas when a transaction commits or ends, an end log record has been written in the disk.
There is also a different kind of log record that has been maintained, which is known as checkpoint
log record. This contains the dirty page table along with the list of all the active transactions. Active
transactions are those transactions that were executed during the failure.
It also maintains the LSN of the last record written by the transaction.
ARIES recovery algorithm consists of the following three phases:
 Analysis phase
 Redo phase and
 Undo phase
Analysis Phase
In analysis phase, the dirty page table has been formed by analysing the pages from the buffer and also
a set of active transactions has been identified. When the system encounters a crash, ARIES recovery
manager starts the analysis phase by finding the last checkpoint log record. After that, it prepares the
dirty page table. This phase mainly prepares a set of active transactions that are needed to be undone.
In the analysis phase, after getting the last checkpoint log record, the log record is scanned in forward
16
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN DEEMED-TO-BE UNI VE RSI TY
direction and update of the set of active transactions, transaction table and dirty page table are done
in the following manner:
 If the recovery manager finds any transaction in the transaction table, which is not in the active
transaction set, then add that transaction in that set
 If it finds an end log record, then that record has been deleted from the transaction table.
 If it finds a log record that describes an update operation, then that log record has been added to
the dirty page table.
The transactions that are added in the active transaction set are all required to be undone.
Redo Phase
Redo phase is the second phase where all the transactions that are needed to be executed again take
place. It executes those operations whose results are not reflected in the disk. It can be done by finding
the smallest LSN of all the dirty pages in dirty page table that defines the log position and the redo
operation will start from this position. This position indicates that either the changes that are made
earlier are in the main memory or have already been flushed to the disk. Thus, for each change recorded
in the log, the redo phase determines whether or not the operations have been re-executed.
Undo Phase
In the undo phase, all the transactions that are listed in the active transaction set have to be undone. Thus,
the log should be scanned backward from the end and the recovery manager should undo the necessary
operations. Each time an operation is undone, a compensation log record has been written to the log. This
process continues until there is no transaction left in the active transaction set.
After the successful completion of this phase, a database can resume its normal operations.
3.5.5 Checkpoints
It is observed that the three techniques that one can think of sufficient enough to recover from a failure.
However, every time recovery requires scanning the log completely to find which transactions are
committed and which are not. In addition, depending on the state of the transaction, different recovery
techniques have been applied. In Figure 3, where T1, T2, T3, T4 and T5 are executing concurrently. If a
failure occurs at time t, then it is needed to apply redo recovery for T1, T3 and T4 and undo recovery can
be applied to T2 and T5. Now, it can be seen that T1, T3 and T4 are already committed and their values
are already stored in the database. However, after failure, it is needed to redo all the transaction that is
not efficient in terms of utilisation of resources. Hence, the idea of a checkpoint comes in this situation.
Figure 3 shows the set of interleaved transactions:
Updated Database UNDO Recovery

A = a2; B = b1 A = a1; B = b1
Figure 3: Set of Interleaved Transactions

Checkpoint saves all the records of the committed transactions in the database along with the log that
is stored in a storage, which is assumed to be never lost in failure. Thus, the log also contains only those
17
JGI JAIN
data items that are required by the currently executing transaction. This, at the same time, minimises
the load of log and it is not needed to redo operation in case of a failure for a committed transaction. In
a simple checkpoint, the following conditions should hold:
 No new transactions are accepted. The system needs to focus on the currently executing transaction.
 Wait is required until all the transactions can commit or abort on that time boundary and then
transfer the log to the stable storage.
 Write a log record <CHECKPOINT> in the log so that during the recovery, the system identifies all the
uncommitted transactions, which occur after the checkpoint.
 Update the database with all the data items that are updated by these transactions.
 After one checkpoint has been finished, the system resumes accepting new transactions.
Therefore, it is needed to apply redo and undo operations for the transactions that begin their
execution after a checkpoint. This simplifies and speeds up the recovery process.
 Let’s modify the previous example by applying checkpoint at time t1. Now, if the system fails at time
t, then in the course of recovery, the system should do the following operations:
 T1 and T4 will not participate in the recovery process as they are already committed before the
checkpoint and their changes have already been written to the database.
 T2 will participate in the recovery process as it is committed after the checkpoint and before the
failure. Thus, it needs to undergo redo recovery process.
 For transaction T3, as it starts after the checkpoint and finishes before the failure, it also needs to
undergo redo recovery process. Figure 4 shows an example of a checklist:
t1 Time axis
checkpoint
Figure 4: Example of Checkpoint

 As transaction T5 starts after the checkpoint and in the middle of its transaction failure occurs, it
needs to undergo undo recovery process as it has not been committed in its lifetime.
3.6 LAB EXERCISE

Write a query to demonstrate the concepts for rollback, commit and checkpoints.
The following SQL program is used to display the commit command:
Commit
BEGIN TRANSACTION;
CREATE TABLE Employee(Id integer, Name VARCHAR, Designation VARCHAR, Age
Integer);
18
INSERT INTO Employee VALUES(1,'Yogita', 'Content Writer', 26);

INSERT INTO Employee VALUES(2,'Lucky', 'Sales Manager', 45);
INSERT INTO Employee VALUES(3,'Rahul', 'Account Manager', 36);
INSERT INTO Employee VALUES(4,'Mayank', 'Content Editor', 32);
INSERT INTO Employee VALUES(5,'Rihan', 'Content Writer', 29);
SELECT * FROM Employee;
The output of the given SQL command is:
1|Yogita|Content Writer|26
2|Lucky|Sales Manager|45
3|Rahul|Account Manager|36
4|Mayank|Content Editor|32
5|Rihan|Content Writer|29
Delete FROM Employee where AGE=36;
commit;
The output of the given SQL command is:
Rollback
The following SQL program is used to display the rollback command:
BEGIN TRANSACTION;
CREATE TABLE Employee(Id integer, Name VARCHAR, Designation VARCHAR, Age
Integer);
INSERT INTO Employee VALUES(1,'Yogita', 'Content Writer', 26);
INSERT INTO Employee VALUES(2,'Lucky', 'Sales Manager', 45);
INSERT INTO Employee VALUES(4,'Mayank', 'Content Editor', 32);
INSERT INTO Employee VALUES(5,'Rihan', 'Content Writer', 29);
The output of the given SQL command is as follows:
Delete FROM Employee where AGE=36;
Rollback;
The output of the given SQL command is as follows:
19
JGI JAIN DEEMED-TO-BE UNI VE RSI TY
 Transaction is a unit of database operation. It can be a read operation, a write operation, or a read/
write operation.
 Serializability theory in DBMS deals with maintaining the correctness of concurrent execution.
 Serial schedule is a type of schedule where no transaction ever interleaved with each other.
 Two schedules can be an equivalent schedule if the execution of one schedule leaves the database
in a state.
 View serializability has less stringent rules than the rules applied for conflict serializability.
 Multi-version 2PL protocol is based on the 2PL protocol and multi-version timestamp-based protocol.
 Recovery from transaction failures often entails restoring the database to the most recent consistent
state just prior to the moment of failure.
 WAL is a log-based file system recovery mechanism that writes changes to permanent data to a log
file called the WAL log.
 The advantage of ARIES recovery technique is that it reduces the recovery time as well as the
overhead of a log.
 In analysis phase, the dirty page table has been formed by analysing the pages from the buffer and
also a set of active transactions has been identified.
 Redo phase is the second phase where all the transactions that are needed to be executed again take
place.
 In the undo phase, all the transactions that are listed in the active transaction set have to be undone.
3.8 GLOSSARY
 Transaction: A unit of database operation. It can be a read operation, a write operation, or a read/
write operation
 Serializability: The theory in DBMS deals with maintaining the correctness of concurrent execution
 Serial schedule: A type of schedule where no transaction ever interleaved with each other
 Equivalent schedule: Two schedules can be an equivalent schedule if the execution of one schedule
leaves the database in a state
 View serializability: It has less stringent rules than the rules applied for conflict serializability
 Multi-version 2PL protocol: It is based on the 2PL protocol and multi-version timestamp-based
protocol
 Recovery: The transaction failures often entail restoring the database to the most recent consistent
state just prior to the moment of failure
 WAL: A log-based file system recovery mechanism that writes changes to permanent data to a log
file called the WAL log
 ARIES: The advantage of the ARIES recovery technique is that it reduces the recovery time as well
as the overhead of a log
20
 Analysis phase: The dirty page table has been formed by analysing the pages from the buffer and
also a set of active transactions has been identified
 Redo phase: The second phase where all the transactions that are needed to be executed again take
place
 Undo phase: The transactions that are listed in the active transaction set have to be undone

1. Describe the concept of a transaction.
2. Serializability provides us a means to schedule the transaction so that these two transactions never
conflict with each other. Discuss
3. The two concurrency control protocols 2PL and timestamp-based protocol, where a conflicting READ
operation needs to wait until the data value has been updated. Describe the multi-version
concurrency control protocols in brief.
4. By keeping the WAL log, the updated permanent data blocks are no longer required to be written to
disc when each block is modified. Explain the significance of WAL.
5. In ARIES, a log consists of multiple log files. Evaluate the concept of ARIES.

1. Transaction is a unit of database operation. It can be a read operation, a write operation or a read/
write operation. Refer to Section Introduction
2. Serializability theory in DBMS deals with maintaining the correctness of concurrent execution.Refer
to Section Review of ACID Properties and Serializability
3. The concurrency of the READ operation can be maintained by keeping a current copy or version of
the data item that is to be read. This scheme is known as multi-version concurrency control scheme,
where no READ requests are delayed. Refer to Section Multiversion Concurrency Control Techniques
4. WAL is a log-based file system recovery mechanism that writes changes to permanent data to a log
file called the WAL log. Refer to Section Advanced Database Recovery Techniques
5. The most advanced and most difficult recovery algorithm is the ARIES recovery algorithm.The
advantage of this recovery technique is that it reduces the recovery time as well as the overhead of a
log. Refer to Section Advanced Database Recovery Techniques
 https://www.google.co.in/books/edition/Transaction_Processing/VFKbCgAAQBAJ?hl=
en&gbpv=1&dq=acid+properties+in+dbms&printsec=frontcover
 https://cs.stanford.edu/people/chrismre/cs345/rl/aries.pdf
21
JGI JAIN
 Discuss with your friends and classmates the concept of transaction processing. Also, discuss the
importance of serializability in database management.
22
UNIT
04 Data Security
Names of Sub-Units
Introduction, Database Security Issues, Discretionary Access Control Based on Granting and Revoking
Privileges, Mandatory Access Control, Role-based Access Control for Multilevel Security, SQL Injection,
Introduction to Statistical Database Security, Introduction to Flow Control
Overview
This unit begins by discussing about the concept of data security and database security issues. Next,
the unit discusses the discretionary access control based on granting and revoking privileges and
mandatory access control. Further the unit explains the role-based access control for multilevel
security and SQL injection. Towards the end, the unit discusses the introduction to statistical database
security and flow control.
Learning Objectives

 Discuss the concept of data security and database security issues
 Explain the concept of discretionary access control based on granting and revoking privileges
 Describe the mandatory access control and role based access control for multilevel security
 Summarise the significance of SQL injection and statistical database security
 Outline the importance of flow control
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Learning Outcomes

 Evaluate the concept of data security and database security issues
 Assess the concept of discretionary access control based on granting and revoking privileges
 Evaluate the importance of mandatory access control and role based access control for multilevel
security
 Determine the significance of SQL injection and statistical database security
 Explore the importance of flow control
 http://www.sis.pitt.edu/jjoshi/IS2935/Fall04/p227-denning.pdf
4.1 INTRODUCTION
Data security refers to the techniques and standards we employ to safeguard digital information against
unauthorised or unintentional access, alterations and disclosure throughout its lifespan. Organisational
strategies, policies, procedures and technology that can secure data against cyberattacks, criminal
intrusions or unintended data spills are also examples of data security.
Physical security of hardware and network assets housing protected data, administrative controls and
policies and logical defensive mechanisms for software accessing the data are all part of security data
practice.
4.2 DATABASE SECURITY ISSUES

Database security refers to a set of techniques intended to protect database management systems
against hostile cyber-attacks and unauthorised use. Database security measures are intended to
safeguard not just the data contained inside the database, but also the data management system and
any applications that access it, against misuse, damage and infiltration.
Database security refers to the tools, methods and methodologies that are used to create security inside
a database system.
4.2.1 Types of Database Security

The basic goal of database security is to protect a database’s sensitive data while also maintaining
the database’s confidentiality, integrity and availability. The types of database security are important
strategies for ensuring database security.
Database security is necessary to protect against cyber-attacks that can result in financial loss, brand
reputation damage, business continuity and customer trust. The following are the main database
security types:
 Authentication: Database authentication is a category of database security that verifies a user’s
login identifications before storing them in the database. The database can be accessed if the user’s
2
UNIT 04: Data Security JGI JAIN
login credentials match those in the database. This indicates that the user is authorised to access
your database.
If a legitimate user has some data access privileges, he will not be able to access data that he doesn’t
have. Your database is not accessible to any unauthorised or malicious users. As a result, database
authentication is critical for ensuring database security.
 Database encryption: One of the most effective methods of database security is encryption, which
protects your database from unauthorised access while information is being stored and transmitted
over the Internet.
AES, MD5 and SHA 1 are examples of encryption algorithms that are used to encrypt and decrypt
various sorts of sensitive data.
Within a database, an encryption method often converts plain text data into ciphertext of
unreadable types. Therefore, if hackers gain access to your database, they will not be able to utilise
your information until it is decrypted.
Encrypting sensitive data before putting it in a database is strongly recommended because it
provides security and protects against cyber-attacks.
 Backup database: Backup is a type of database security that is used to recover data in the event of
data loss, corruption, hacking or natural disasters. In real time, it copies or archives the database
on a secondary storage.
If the primary and secondary servers are configured in the same location and the primary server is
destroyed, the secondary server may be destroyed as well. As a result, you will be unable to run your
application and your system will shut down until you are able to recover.
As a result, it is recommended that the backup server be configured in a physically distinct location
to maintain database security. If the primary server is unavailable, the database can be recovered
via the secondary server.
 Physical security: Physical database security refers to the safeguarding of the database server
room from unwanted access. The database server should be housed in a secure, climate-controlled
environment.
 Application security: To protect against web assaults like SQL injection, you must secure your
application and database. SQL injection is one of the most prevalent web attacks, in which a hacker
takes control of an application’s database in order to steal sensitive information or destroy it.
In this strategy, the attacker uses web page input to introduce malicious code into a SQL query. This
error happens when a programme fails to properly sanitise SQL statements. As a result, an attacker
can use malicious SQL queries to get access to your database and use it for destructive purposes.
You may secure your application against SQL injection attacks by using the following preventative
methods:
 Preparation of statements
 Utilisation of web application firewall
 Updating your computer system
 Validating the input of the user
 Privileges are restricted.
 Use pre-defined procedures
3
JGI JAINDEEMED-TO-BE UNIVERSIT Y
 Access Control: You must restrict unauthorised users’ access to the database to guarantee database
security. Only authorised users have access to the database and no unauthorised users have access.
Create user accounts for DBAs who will access the database, assign them a role and restrict what
they may do with it.
As a result, access control is a sort of database security that protects your database by preventing
unauthorised people from accessing it.
 Web Application Firewall (WAF): AWAF is a database security that is an application-based cyber
security technology. WAF is a web application firewall that protects applications by screening,
monitoring and blocking dangerous HTTP traffic.
This database security measure restricts who has access to the application and prevents outsiders
from using the internet to access it. A web application firewall, which will safeguard your application
and database from malicious users, should be used to secure your programme.
 Use strong password: This is an easy but crucial list of suggestions for ensuring database security.
You should use a strong password for database login as a DBA or IT officer and never disclose your
password with others.
If you use a simple password such as your phone number, employee ID or date of birth, hackers will
try to get in using these passwords. As a result, you’ll lose control of your database.
As a result, develop a strong password for database login that includes a combination of letters,
numbers and special characters (a total of at least 10 characters) and change it on a frequent basis.
 Database auditing: Auditing is a form of database security check that can help you detect and
identify unauthorised database access (Database Management System).
Database auditing examines log files on a regular basis to determine who accessed the database,
when they did so, how long they stayed and what they did while there. You can immediately discover
if the database server has been accessed without authorisation.
4.2.2 Control Measures of Database Security

The following are the key control measures used to protect database data security:
 Access control: Unauthorised users’ access to the data base must be limited by the security
mechanism of the database management system. The DBMS manages access control by creating
user accounts and controlling the login procedure. As a result, database access to sensitive data
is limited to those people (database users) who are permitted to do so and unauthorised access is
prohibited.
During the whole login time, the database system must also maintain track of all operations
conducted by a certain user.
 Inference control: This strategy is referred to as the statistical database security countermeasures.
Its goal is to keep the user from completing any inference channels. This strategy prevents sensitive
information from being disclosed inadvertently.
The inference control is classified into two types is as follows:
 Identity disclosure
 Attribute disclosure
 Flow control: This prohibits information from moving in such a way that unauthorised users can
access it. Convert channels are the pathways by which information flows implicitly in ways that
violate a company’s privacy policy.
4
 Data encryption: This method is mostly used to protect sensitive information (such as credit card
numbers and OTP numbers) and other numbers. Some encoding algorithms are used to encode the
data.
Unauthorised users will have a tough time decoding this encoded data, whereas authorised users
are given decoding keys to decode data.
4.3 DISCRETIONARY ACCESS CONTROL BASED ON GRANTING AND REVOKING PRIVILEGES

The granting and revoking of privileges is the most common technique of imposing discretionary access
control in a database system. Consider privileges in the context of a relational database management
system. We will focus on a privilege system that is comparable to the one designed for the SQL language,
many modern relational DBMSs employ some form of this method. The fundamental concept is to
include lines in the query language that allow the DBA and a limited group of users to grant and remove
credentials.
4.3.1 Types of Discretionary Privileges

The idea of an authorisation identifier is used in SQL2 and subsequent versions 3 to refer to, generally
speaking, a user account (or group of user accounts). To keep things simple, we’ll use the terms used and
account interchangeably in place of authorisation identification. The database management system
(DBMS) must provide for selective access to each relation in the database based on specific accounts.
Because operations can be regulated, having an account does not necessarily entitle the account holder
to all of the DBMS’s capabilities. Informally, there are two tiers for granting access to the database
system:
 The account level is: At this level, the DBA determines the specific rights that each account has
irrespective of the database relations.
 The level of the connection (or table): At this level, the DBA may manage who has access to
each specific relation or view in the database. The account level privileges apply to the account’s
capabilities and can include the CREATE SCHEMA or CREATE TABLE privilege, which allows you to
create a schema or base relation, the CREATE VIEW privilege; the ALTER privilege, which allows you
to apply schema changes such as adding or removing attributes from relations; the DROP privilege,
which allows you to delete relations or views; the MODIFY privilege, which allows you to insert, delete
or update tuples and the SELECT. It is worth noting that these account rights apply to the account
as a whole. No relations may be formed from an account that does not have the CREATE TABLE
privilege. Account-level rights are not defined in SQL2, instead, they are determined by the DBMS
implementers. Earlier versions of SQL included a CREATETAB capability that granted an account
the ability to create tables (relations).
The second level of privileges applies to the relation level, regardless of whether the relations are base or
virtual (view). These rights are exclusive to SQL2. Unless otherwise specified, the term relationship in the
following discussion can refer to either a base relation or a view. Privileges at the relation level define
which particular relations each sort of command can be applied to for each user. Some privileges also
apply to specific columns (attributes) of relations.
SQL2 commands only grant rights at the relation and attribute levels. Although this is pretty generic,
it makes creating accounts with limited rights difficult. The granting and revocation of privileges often
follows an authorisation paradigm known as the access matrix model, in which the rows of a matrix M
represent subjects (people, accounts, programs) and the columns represent objects (relations, records,
columns, views, operations). Each matrix position M(i, j) indicates the sorts of rights (read, write, update)
5
JGI JAIN
that subject I has on object j. Each relation R in a database is allocated an owner account, which is often
the account that was used when the relationship was formed in the first place, to govern the granting
and canceling of related rights. All privileges on a relation are granted to the owner. Using the CREATE
SCHEMA command in SQL2, the DBA may designate an owner to an entire schema by establishing the
schema and attaching the relevant authorisation identification with it.
By providing rights to other users’ accounts, the owner account holder can convey privileges on any
of the owned relations to them. Each unique relation R in SQL can be granted the following sorts of
privileges:
 R has the SELECT (retrieval or read) privilege: Allows you to retrieve your account. In SQL, this
grants the account the ability to get tuples from R using the SELECT command.
 Modification rights on R: This grants the account the ability to alter R’s tuples. This covers three
rights in SQL: UPDATE, DELETE and INSERT. These are the three SQL instructions for changing a table
R. Furthermore, both the INSERT and UPDATE capabilities let the account modify just particular
characteristics of R.
 On R, you have the following privilege: When establishing integrity restrictions, the account now
has the ability to reference (or refer to) a relation R. This permission can also be limited to particular
R characteristics. It is important to note that in order to provide the query that corresponds to the
view, the account must have the SELECT permission on all relations included in the view definition.
4.3.2 Specifying Privileges through the Use of Views

The view mechanism is an essential discretionary permission method in and of itself. If the owner A of a
relation R wants another account B to be allowed to obtain only particular fields of R, A can construct a
view V of R that includes only those characteristics and then give SELECT on V to B. The same is true for
restricting B to getting just particular tuples of R; a view V may be formed by specifying the view using
a query that picks only those tuples from R that A wishes B to access.
4.3.3 Revoking of Privileges

In some instances, it is preferable to temporarily grant a user permission. For example, the owner of
a relationship may wish to provide a user the SELECT access for a certain job and then withdraw that
privilege after the activity has been finished. As a result, a method for canceling rights is required. A
REVOKE command is included in SQL for revoking privileges.
4.3.4 Propagation of Privileges Using the GRANT OPTION

When owner A of a relation R offers a privilege to another account B, the permission can be granted
with or without the GRANT OPTION. If the Transfer OPTION is selected, B can also grant that privilege to
other accounts on R. Assume that A provides the GRANT OPTION to B and that B subsequently transfers
the privilege on R to a third account C, which also has the GRANT OPTION. Privileges on R can, thus, be
propagated to other accounts without the knowledge of R’s owner. If the owner account A now revokes
the privilege assigned to B, the system should instantly rescind any rights that B spread based on that
privilege.
A user may obtain certain permission from two or more sources. A4 may, for example, acquire a specific
UPDATE R permission from both A2 and A3. In such a circumstance, even if A2 revokes this access from
A4, A4 will still hold the privilege because it was granted by A3. If A3 later revokes the privilege from
6
A4, A4 loses the privilege entirely. As a result, a DBMS that permits privilege propagation must keep
account of how all rights were issued so that canceling privileges may be done effectively.
4.3.5 Specifying Limits on Propagation of Privileges

Techniques for limiting privilege propagation have been devised, but they have not yet been implemented
in most DBMSs and are not part of SQL. By limiting horizontal propagation to an integer number, an
account B with the GRANT OPTION can grant the privilege to the majority of I additional accounts.
Vertical propagation is more challenging; it restricts the depth of privilege granting. A privilege with
a vertical propagation of zero is comparable to a privilege with no GRANT OPTION. If account A gives
the privilege to account B and the vertical propagation is set to an integer value j > 0, account B has the
GRANT OPTION on that permission. However, B can only extend the permission to other accounts if the
vertical propagation is smaller than j. Vertical propagation, in effect, restricts the number of GRANT
OPTIONS that may be granted from one account to the next based on a single original grant of the
privilege.
With an example, we quickly demonstrate horizontal and vertical propagation restrictions, which
are currently unavailable in SQL or other relational systems. Assume that A1 gives SELECT to A2 on
the EMPLOYEE relation, with horizontal propagation of 1 and vertical propagation of 2. Because the
horizontal propagation limitation is set to one, A2 may only provide SELECT to one account. Furthermore,
A2 cannot grant the privilege to another account unless vertical propagation is set to 0 (no GRANT
OPTION) or 1; this is due to A2’s requirement to lower vertical propagation by at least one when giving
the permission to others. Furthermore, the horizontal propagation must be less than or equal to the
horizontal propagation that was initially authorised. For instance, if account A provides a privilege to
account B with the horizontal propagation set to an integer value j > 0, B can only grant permission to
other accounts with a horizontal propagation less than or equal to j. As seen by this example, horizontal
and vertical propagation mechanisms are intended to restrict the depth and breadth of privilege
dissemination.
4.4 MANDATORY ACCESS CONTROL (MAC)

Historically, the dominant security strategy for relational database systems has been the discretionary
access control technique of giving and revoking privileges on relations. This is a one-size-fits-all approach:
A user has or does not have a specific privilege. Many applications require an extra security policy
that categorises data and users based on security classes, this method, known as MAC, is often used
in conjunction with the discretionary access control strategies outlined. It is crucial to remember that
most commercial DBMSs now only provide discretionary access control measures. However, multilayer
security is required in government, military and intelligence applications, as well as many industrial
and corporate applications. Oracle, for example, has provided customised versions of its RDBMSs that
have obligatory access control for government usage.
Top secret (TS), secret (S), confidential (C) and unclassified (U) are the most common security levels, with
TS being the highest and U being the lowest. Other, more complicated security categorisation methods
exist, with security classes grouped in a lattice. To illustrate our point, we will utilise a system having
four security classification levels, where TS ≥ S ≥ C ≥ U U. The Bell-LaPadula model, which is widely
used for multilevel security, classifies each subject (user account, program) and object (relation, tuple,
column, view, action) into one of four security classifications: TS, S, C or U. The clearing (classification)
of a subject S will be referred to as a class(S) and the classification of an object O will be referred to as
a class (O) (O).
7
JGI JAIN
Data access is restricted in two ways based on subject/object categorisation:

 1A subject S is not permitted to read an object O unless class(S) ≥ class(O). This is referred to as a
simple security feature.
 A subject S is not permitted to write an object O until class(S) class(S) class(S) class(S) class(S)
class(S) class(S) class(S) class (S) (O). This is referred to as the star property (or the *-property).
The first limitation is self-explanatory, enforcing the simple rule that no subject can read an item
with a security classification greater than the subject’s security clearance. The second constraint
is less obvious. It forbids a subject from writing an item with a security classification lower than
the subject’s security clearance. Infringing on this criterion would allow information to move from
higher to lower classes, which would be a violation of a fundamental premise of multilayer security.
A user (subject) with TS clearance, for example, may make a duplicate of an item with classification
TS and then write it back as a new object with classification U, making it visible across the system.
To include multilayer security concepts into the relational database paradigm, attribute values and
tuples are commonly used as data objects. As a result, each attribute A in the schema is connected with
a classification attribute C and each attribute value in a tuple is associated with a matching security
classification. Furthermore, in certain models, a tuple classification property TC is added to the relation
attributes to offer a categorisation for the entire tuple. Because it supports classifications at several
security levels, the model we describe here is known as the multilayer model. A multilevel relation
schema R with n properties is expressed as follows:
R (A1, C1, A2, C2, ..., An, Cn, TC)
Where Ci denotes the categorisation property linked with Ai. The tuple classification attribute TC in each
tuple t, which is the highest of all attribute classification values inside t, offers a generic categorisation
for the tuple itself. Each attribute classification Ci gives a tighter security categorisation for each tuple
attribute value. In each tuple t, the value of TC is the greatest of all attribute classification values Ci
inside t.
A multilevel connection’s apparent key is the set of properties that would have constituted the primary
key in a conventional (single-level) relation. A multilevel connection will appear to include data for
subjects (users) with varying clearance levels.
In certain circumstances, a single tuple can be stored in the relation at a higher classification level and
the matching tuples can be produced at a lower classification level via a process called filtering. In other
circumstances, two or more tuples at various categorisation levels with the same value for the apparent
key must be stored.
This gives rise to the idea of polyinstantiation, 4 in which many tuples might have the same apparent
key value but distinct attribute values for users with varying clearance levels.
4.4.1 Difference between Discretionary Access Control and Mandatory Access Control
The difference between the DAC and MAC is shown in Table 1:
Table 1: Difference between DAC & MAC
Discretionary Access Control (DAC) Mandatory Access Control (MAC)

DAC stands for Discretionary Access Control. MAC stands for Mandatory Access Control.
The implementation of DAC is simple. The implementation of MAC is difficult.
The use of DAC is less secure. The use of MAC is more secure.
8
Discretionary Access Control (DAC) Mandatory Access Control (MAC)

The owner of DAC can control access and privileges, as The system simply assesses access in MAC and
well as restrict resources, based on the identification resources are restricted based on the individuals’
of the users. clearance.
The properties of DAC are additionally labor-intensive. MAC does not have any labor-intensive properties.
Users’ access will be granted based on their identities Users will be limited based on their power and
rather than levels. hierarchy level.
With no rules and regulations, DAC has a lot of The MAC is not flexible because it comprises several
flexibility. tight rules and regulations.
DAC has full faith in its customers. Only administrators are trusted by MAC.
Only the user ID and ownership will be used to make Objects and tasks will be used to make decisions and
decisions. they will have their own identifiers.
It is impossible to control the flow of information in The flow of information can be easily managed in MAC.
DAC
Commercial database management systems support Commercial DBMSs do not support MAC.
DAC.
DAC can be used in any field. MAC can be used by the military, government and
intelligence services.
4.4.2 Row-Level Access Control and Label-Based Security

Many commercial DBMSs now employ the idea of row-level access control, which allows for the
implementation of sophisticated access control rules by examining the data row by row. Row-level access
control assigns a label to each data row, which is used to record information regarding data sensitivity.
Row-level access control improves data security by allowing permissions to be configured for each row
rather than simply the table or column. The database administrator assigns the user a default session
label at first. Levels represent a hierarchy of data-sensitivity levels to disclosure or corruption, with
the objective of preserving privacy or security. Labels are used to restrict unauthorised people from
seeing or changing data. A user with a low authorisation level, commonly indicated by a low number,
is disallowed access to data with a higher authorisation level. If no such label is set to a row, a row label
is assigned to it based on the user’s session label. A Label Security policy is the policy is automatically
executed whenever data impacted by the policy is accessed or queried by an application. When a policy
is applied, it adds a new column to each entry in the schema. The new column has a label for each row
that shows the sensitivity of the row according to the guideline. In label-based security, each user has
an identification, similar to how each user has a security clearance in MAC. The identity of this user
is compared to the label provided to each row to determine whether or not the user has permission to
read the contents of that row. However, given certain constraints and rules for that individual row, the
user can write the label value themself. This label can be adjusted to a value between the current session
label and the user’s lowest level. The DBA has the authority to define an initial default row label.
For each user, the label security requirements are implemented on top of the DAC requirements. As a
result, in order to access a row, the user must first meet the DAC criteria and then the label security
requirements. The DAC requirements ensure that the user has the legal authority to perform that
action on the schema. Only a subset of the tables in most applications requires label-based security. The
protection given by DAC is enough for the vast majority of application tables.
Managers and human resources staff are often in charge of developing security policies. The policies are
high-level, technology-agnostic and risk-related. Policies are created as a consequence of management
9
JGI JAIN
directives to establish organisational processes, guiding principles and courses of action that are deemed
expedient, wise or advantageous. Policies are usually accompanied by a description of the consequences
and actions that will be implemented if the policy is broken. The label security administrator interprets
and converts these rules to a collection of label-oriented policies, which specify the security labels for
data and authorisations for users, these labels and authorisations regulate access to specified protected
objects. Assume a user has SELECT permissions on a table. Label security will automatically analyze
each row returned by the query to verify whether the user has permission to read the data when the
user conducts a SELECT statement on that table. For example, if the user’s sensitivity is set to 20, he
or she can access any rows with a security level of 20 or below. The level specifies how sensitive the
information in a row is; the more sensitive the row, the higher its security label value. Label Security
may also be used to run security checks on UPDATE, DELETE and INSERT statements.
4.4.3 E-Commerce Access Control Policies

Web electronic commerce (e-commerce) environments are distinguished by any electronic transactions.
They need complex access control policies that go beyond the capabilities of typical DBMSs. Access
control in traditional database setups is often accomplished using a set of authorisations specified
by security officers or users in accordance with specific security regulations. A simplistic paradigm is
unsuitable for a dynamic context such as e-commerce. Furthermore, in an e-commerce context, not just
traditional data but also knowledge and expertise must be secured. Such anomalies necessitate greater
latitude in defining access control strategies. The access control system must be adaptable enough
to accommodate a diverse range of protective objects. Support for content-based access control is a
second criterion. Content-based access control allows you to describe access control policies that take
into consideration the content of the protected item. Access control policies must allow the introduction
of criteria based on the object content in order to implement content-based access control.
A third criterion relates to topic heterogeneity, which necessitates access control policies that are
based on user traits and credentials rather than unique and individual characteristics (for example,
user IDs). Supporting the concept of credentials is one possible approach for better taking into account
user profiles in the creation of access control rules. A credential is a set of attributes about a person
that is important for security reasons (for example, age or position or role within an organisation). For
example, by employing credentials, one can easily design regulations such: Documents pertaining to the
system’s internals are only accessible to permanent workers with five or more years of service.
Because XML is becoming the ubiquitous representation language for document exchange over the
Web, as well as the language for e-commerce, it is projected to play a crucial role in access control for e-
commerce applications5. On the one hand, there is a need to secure XML representations by providing
access control techniques that are specially designed for the protection of XML documents. Access control
information, on the other hand (access control policies and user credentials), may be described using
XML. The Directory Services Markup Language (DSML) is an XML-based representation of directory
service information. It lays the groundwork for a standard for connecting with directory services, which
will be in charge of giving and validating user credentials. Policies and credentials can benefit from a
consistent display of both protection objects and access control policies. For example, some credential
characteristics (such as the user’s name) may be available to all users, whilst others may be shown only
to a select group of users. Furthermore, using an XML-based language to express credentials and access
control rules simplifies safe credential submission and access control policy export.
4.5 ROLE-BASED ACCESS CONTROL (RBAC) FOR MULTILEVEL SECURITY

In the 1990s, RBAC evolved fast as a validated technique for managing and enforcing security in large-
scale enterprise-wide systems. Its fundamental premise is that privileges and other permits are linked
10
to organisational roles rather than individual individuals. Individual users are then given roles that
are appropriate for them. The CREATE ROLE and DESTROY ROLE commands can be used to create and
destroy roles. When necessary, the GRANT and Remove commands can be used to assign and revoke
rights from roles as well as individual users. A firm, for example, may include positions such as sales
account manager, buying agent, mailroom clerk, department manager and so on. Each position can be
allocated to several people. Security privileges that are common to a position are provided to the role
name and any individual assigned to this role will have those privileges automatically granted to them.
RBAC can be used in conjunction with standard discretionary and obligatory access controls to
guarantee that only authorised individuals in their designated roles have access to certain data or
resources. Users establish sessions in which they can activate a subset of the roles to which they are
assigned. Each session can be assigned to several roles, but it only maps to one user or one subject. Many
DBMSs support the idea of roles, which allow rights to be granted to roles.
Another significant criterion in many commercial DBMSs is the separation of roles. It is required
to prohibit one user from performing work that needs the participation of two or more persons,
hence preventing collusion. The mutual exclusion of responsibilities is one approach for successfully
implementing separation of duties. If a user cannot employ both roles at the same time, they are said to
be mutually exclusive. Role mutual exclusion may be divided into two types: authorisation time exclusion
(static) and runtime exclusion (dynamic). Two roles that have been declared as mutually exclusive
cannot be part of a user’s authorisation at the same time in authorisation time exclusion. Both of these
roles can be permitted to a single user under runtime exclusion, but they cannot be engaged by the user
at the same time. Another type of mutual exclusion of roles is total and partial exclusion.
The role hierarchy in RBAC is a logical approach to structure roles to represent the lines of authority
and responsibility in the company. Junior positions at the bottom are typically linked to more senior
roles as one goes up the hierarchy. As partial orders, the hierarchic diagrams are reflexive, transitive
and anti-symmetric. In other words, if a person holds one position, he or she is automatically assigned
to roles below in the hierarchy. Defining a role hierarchy entail deciding on the sort of hierarchy and the
roles and then putting the hierarchy into action by assigning roles to other roles. The following method
may be used to construct a role hierarchy:
GRANT ROLE full_time TO employee_type1
GRANT ROLE intern TO employee_type2
The examples above show how to assign the roles of full-time and intern to two different sorts of workers.
Identity management is another security concern. The term “identity” refers to a person’s distinct name.
Because legal names are not always unique, a person’s identification must include enough additional
information to make the whole name unique. Identity Management refers to the process of authorizing
this identity and controlling the schema of these identities. Identity Management is concerned with how
businesses can efficiently verify individuals and regulate their access to sensitive information. It has
grown more obvious as a business need in many industries, affecting businesses of all sizes. Identity
Management administrators must continuously satisfy application owners while controlling costs
and enhancing IT efficiency. Another essential factor to consider in RBAC systems is the possibility of
temporal limits on roles, such as the time and length of role activations and timed triggering of a role
by activation of another function. Using an RBAC architecture to fulfill the major security needs of Web-
based applications is a highly desirable aim. Roles can be given to workflow tasks such that a user has
any of the roles associated with the job is permitted to execute it and can only play a specific role for a
limited time.
RBAC models have a number of appealing characteristics, including flexibility, policy neutrality,
improved support for security management and administration and other characteristics that make
11
JGI JAIN
them appealing candidates for designing secure Web-based applications. These characteristics are
absent in DAC and MAC versions. RBAC models also incorporate the features seen in standard DAC
and MAC policies. Furthermore, an RBAC model includes tools for dealing with security risks that
arise during the execution of activities and processes, as well as for establishing user-defined and
organisation-specific regulations. Another reason for RBAC models’ success is their ease of deployment
through the Internet.
4.6 SQL INJECTION

SQL injection, often known as SQLI, is a typical attack vector in which malicious SQL code is used to
manipulate backend databases and get access to data that was not intended to be revealed. Everything
from sensitive corporation data to user lists to private consumer information could be included in this
data. SQL injection can have a huge impact on a company’s bottom line. A successful attack could result
in the attacker accessing illegal user lists, deleting complete tables and, in some situations, obtaining
administrative rights to a database, all of which are extremely damaging to a corporation.
When estimating the cost of a SQLi, keep in mind the loss of customer trust if private details such as
phone numbers, addresses and credit card data are stolen. While this vector can be used to attack any
SQL database, the most common target is websites.
4.6.1 Types of SQL Injection

SQL injections are usually divided into three types:
 In-band SQLi
 Inferential (Blind) SQLi

 Out-of-band SQLi
In-band SQLi
The attacker uses the same communication channel to launch assaults and acquire information.
Because of its simplicity and efficiency, in-band SQLi is one of the most prevalent SQLi attacks. This
approach is divided into two sub-variations:
 Error-based SQLi: The attacker takes steps that cause the database to generate error messages. The
data provided by these error messages could be used by the attacker to obtain knowledge about the
database’s structure.
 Union-based SQLi: This technique makes use of the UNION SQL operator, which combines numerous
database select statements into a single HTTP response. This response may contain information
that the attacker can use.
Inferential (Blind) SQLi

To understand more about the server’s structure, the attacker delivers data payloads to it and monitors
its reaction and behaviour. Because data is not communicated from the website database to the attacker,
this method is known as blind SQLi. As a result, the attacker is unable to view information about the
attack in-band.
Blind SQL injections rely on the server’s response and behaviour patterns, thus they’re slower to execute
but just as dangerous. The following are the several types of blind SQL injections:
 Boolean: The attacker sends a SQL query to the database, requesting a response from the application.
Depending on whether the query is true or false, the response will differ. The information in the
12
HTTP response will alter or remain unchanged depending on the outcome. The attacker can then
determine if the message produced a true or false response.
 Time-based: The attacker sends a SQL query to the database, which causes it to wait (for a specified
number of seconds) before responding. The attacker can tell whether a query is valid or false based
on how long it takes the database to answer. An HTTP response will be generated immediately or
after a waiting period based on the result. As a result, the attacker can figure out if the message is
valid.
Out-of-band SQLi
This type of attack is only possible if certain functionalities on the database server used by the web
application are enabled. This type of attack is typically employed as a complement to in-band and
inferential SQLi attacks.
Out-of-band SQLi is used when an attacker can’t launch an attack and gather information over the
same channel or when a server is too slow or unstable to complete these activities. These methods rely
on the server’s ability to send DNS or HTTP requests to send data to an attacker.
4.6.2 Risks Associated with SQL Injection

SQL injection is dangerous and the dangers it entails give motivation for attackers. Some of the dangers
connected with SQL injection attacks are as follows:
 Fingerprinting of databases: The attacker can discover the type of database being used in the
backend so that he can perform database-specific attacks that correlate to DBMS flaws.
 Denial of Service (DoS): The attacker can either bombard the server with requests, preventing
service from legitimate users or erase certain data.
 Authentication bypassing: One of the most typical threats is that the attacker can acquire access to
the database as an authorised user and do all of the needed actions.
 Injectable parameter identification: In this sort of assault, the attacker gets vital information
about the type and structure of a Web application’s back-end database. This attack is made feasible
by the fact that application servers’ default error pages are frequently unnecessarily informative.
 Remote command execution: This gives attackers the ability to run arbitrary commands on the
database. A remote user, for example, can utilise a remote SQL interactive interface to perform
stored database processes and functions.
 Privilege escalation in action: This sort of attack uses logical defects in the database to increase the
access level.
4.7 INTRODUCTION TO STATISTICAL DATABASE SECURITY

Statistical databases are databases that hold information about a large number of people and are mostly
used to generate statistics on diverse groups. Users are, however, permitted to get some demographic
statistics such as population averages for a given state/district, as well as their sum, count maximum,
minimum and standard deviations.
Because of statistical database security techniques prohibit retrieval of individual data, ethical hackers
are responsible for monitoring Statistical Database security. Statistical users are not permitted to access
individual data in databases, such as income of a specific person, phone numbers or debit card numbers
of a specific person. The DBMS is also responsible for maintaining the confidentiality of personal data.
13
JGI JAIN
Statistical queries are those that only enable aggregate functions like COUNT, SUM, MIN, MAX, AVERAGE
and STANDARD DEVIATION. Statistical inquiries are mostly used to obtain demographic statistics and
to maintain employee databases in businesses and sectors.
4.8 INTRODUCTION TO FLOW CONTROL

Flow control governs the distribution or flow of information among items that are accessible. When a
program receives data from X and writes values into Y, a flow between objects X and Y happens. Flow
controls ensure that information contained in some objects does not leak, either expressly or indirectly,
into less protected items. As a result, a user cannot obtain indirectly in Y what he or she cannot have
directly in X. Active flow control first appeared in the early 1970s. Most flow controls use the idea of
security class; information can be sent from a sender to a receiver only if the recipient’s security class is
at least as privileged as the sender’s. Flow control examples include keeping a service application from
leaking a customer’s sensitive data and preventing secret military material from being transmitted to
an unknown classified user.
A flow policy determines the channels via which data may be moved. The most basic flow policy defines
only two types of information: confidential (C) and non-confidential (N) and permits all flows except
those from class C to class N. This policy can help to tackle the problem of confinement that emerges
when a service program handles data such as client information, some of which may be sensitive.
For example, an income-tax calculating service may be permitted to keep a customer’s address and
a bill for services given, but not the customer’s income or deductions. The following processes are in
charge of verifying users’ authorisations for resource access: Only surgeries that have been granted
are carried out. An enhanced access control mechanism can impose flow controls by issuing a security
class (commonly referred to as the clearance) to each running program. The program is only permitted
to access a certain memory segment if its security class is equal to or higher than that of the segment.
It is only permitted to write in a segment if its class is the same as the segments. This ensures that no
information supplied by the individual moves from a higher to a lower class. For example, military
software with a secret clearance can only read from unclassified and confidential objects and write into
secret or top-secret objects. There are two types of flow that may be distinguished:
 Explicit flows: That arise as a result of an assignment instructions, such as Y: =f (X1, Xn,).
 Implicit flows: Generated by conditional instructions, such as if f (Xm+1, ..., Xn) then Y: = f (X1, Xm).
Flow control methods must ensure that only allowed flows are implemented, both explicit and implicit.
To maintain safe information transfers, a set of rules must be followed. Rules can be stated using flow
relations between classes and applied to data, describing the permissible flows within a system. (An
information flow from A to B happens when the value of information associated with A is affected by
information connected with B. The flow is caused by activities that transmit information from one item
to another.) These relations can describe, given a class, the set of classes through which information
(classified in that class) can flow or they can specify the exact relations that must be checked between
two classes in order for information to flow from one to the other. In general, flow control mechanisms
establish controls by giving a label to each item and identifying the object’s security class. The flow
relations specified in the model are then validated using labels.
4.8.1 Covert Channels

A covert channel enables the transmission of information that breaches security or policy. A covert
channel, in particular, permits information to travel from a higher classification level to a lower
classification level using unethical means. Covert channels are divided into two categories: timing
14
channels and storage channels. The difference between the two is that information in a timing channel is
transmitted by the timing of events or processes, whereas information in a storage channel is conveyed
by accessing system information or what would otherwise be inaccessible to the user.
Consider a distributed database system with two nodes with user security levels of secret (S) and
unclassified (U). This is a simple example of a covert channel (U). To commit a transaction, both nodes
must agree to commit. They can only do activities that are consistent with the *-property, which indicates
that the S site cannot write or send information to the U site during any transaction. However, if these
two sites conspire to establish a covert channel between them, the U site may commit a transaction
involving secret data unconditionally, but the S site may do so in some predefined agreed-upon way,
allowing certain information to be passed from the S site to the U site, violating the *-property. This is
possible if the transaction is repeated, but the S site’s activities implicitly send information to the U site.
Locking, as we explored in Chapters, prevents concurrent writing of information into the same objects
by users with different security levels, hence prohibiting storage-type hidden channels. Operating
systems and distributed databases give control over the multiprogramming of processes, allowing
resource sharing without the chance of one program or process encroaching on another’s memory or
other system resources, eliminating timing-oriented hidden channels. Covert channels, in general, are
not a serious issue in well-implemented strong database solutions. Certain strategies, however, may be
devised by astute users to communicate information implicitly.
Some security experts feel that one approach to eliminate hidden channels is to prevent programmers
from gaining access to sensitive data that a program would handle after it has been launched. A bank
programmer, for example, does not need to know the identities or balances in depositors’ accounts.
Brokerage company programmers do not need to know what buy and sell orders their clients have.
Access to a form of actual data or some sample test data during program testing may be justified, but
not once the software has been adopted for regular usage.
 Database security refers to a set of techniques intended to protect database management systems
against hostile cyber-attacks and unauthorised use.
 Database authentication is a category of database security that verifies a user’s login identifications
before storing them in the database.
 Row-level access control allows for the implementation of sophisticated access control rules by
examining the data row by row.
 A Label Security policy is a policy that is defined by an administrator.
 The Directory Services Markup Language (DSML) is an XML-based representation of directory
service information.
 RBAC evolved fast as a validated technique for managing and enforcing security in large-scale
enterprise-wide systems.
 SQL injection, often known as SQLI, is a typical attack vector in which malicious SQL code is used to
manipulate backend databases.
 Statistical databases are databases that hold information about a large number of people and are
mostly used to generate statistics on diverse groups.
 Flow control governs the distribution or flow of information among items that are accessible.
 A covert channel enables the transmission of information that breaches security or policy.
15
JGI JAIN
4.10 GLOSSARY
 Database security: A set of techniques intended to protect database management systems against
hostile cyber-attacks and unauthorised use
 Database authentication: A category of database security that verifies a user’s login identifications
before storing them in the database
 Row-level access control: The implementation of sophisticated access control rules by examining
the data row by row
 Label security policy: A policy that is defined by an administrator
 Directory Services Markup Language (DSML): An XML-based representation of directory service
information
 RBAC: A validated technique for managing and enforcing security in large-scale enterprise-wide
systems
 SQL injection: A typical attack vector in which malicious SQL code is used to manipulate backend
databases
 Statistical database: A databases that hold information about a large number of people and are
mostly used to generate statistics on diverse groups
 Flow control: The distribution or flow of information among items that are accessible
 Covert channel: The transmission of information that breaches security or policy

1. What do you understand by the concept of data security?
2. The dominant security strategy for relational database systems has been the discretionary access
control technique of giving and revoking privileges on relations. Explain the significance of MAC.
3. Row-level access control improves data security by allowing permissions to be configured for each
row rather than simply the table or column. Describe the concept of row-level access control.
4. SQL injection can have a huge impact on a company’s bottom line. Discuss
5. A flow policy determines the channels via which data may be moved. Evaluate the concept of flow
control.

1. Data security refers to the techniques and standards we employ to safeguard digital information
against unauthorised or unintentional access, alterations and disclosure throughout its lifespan.
Refer to Section Introduction
2. Many applications require an extra security policy that categorises data and users based on security
classes, this method, known as MAC. Refer to Section Mandatory Access Control (MAC)
16
3. Many commercial DBMSs now employ the idea of row-level access control, which allows for the
implementation of sophisticated access control rules by examining the data row by row. Refer to
Section Mandatory Access Control (MAC)
4. SQL injection, often known as SQLI, is a typical attack vector in which malicious SQL code is used to
manipulate backend databases and get access to data that was not intended to be revealed. Refer to
Section SQ Injection
5. Flow control governs the distribution or flow of information among items that are accessible. When
a program receives data from X and writes values into Y, a flow between objects X and Y happens.
Refer to Introduction to Flow Control
 https://www.itopstimes.com/itsec/creating-a-multi-layered-security-architecture-for-your-
databases/
 https://dzone.com/articles/10-common-database-security-issues
 Discuss the concept of data security with your friends and classmates. Also, try to find some real
world examples of data security
17
UNIT
05 Storage and Indexing
Names of Sub-Units
Operation on Files, Hashing Techniques, Types of Single-Level Ordered Indexes, Multi-Level Indexes,
Dynamic Multilevel Indexes Using B Trees and B+ Trees, Indexes on Multiple Keys
Overview
This unit begins by discussing about the concept of storage and indexing. Next, the unit discusses the
operation on files, hashing techniques, types of single-level ordered indexes and multi-level indexes.
Further the unit explains the dynamic multilevel indexes using B tress and B+ tress. Towards the end,
the unit discusses the indexes on multiple keys.
Learning Objectives

 Discuss the concept of storage and indexing
 Explain the concept of operation on files, hashing techniques and types of single-level ordered
indexes
 Describe the multi-level indexes
 Explain the significance of dynamic multi-level indexes using B trees and B+ trees
 Discuss the importance of indexes on multiple keys
JGI JAIN
Learning Outcomes

 Evaluate the concept of storage and indexing
 Assess the concept of operation on files, hashing techniques and types of single-level ordered
indexes
 Evaluate the importance of multi-level indexes
 Determine the significance of dynamic multi-level indexes using B trees and B+ trees
 Explore the importance of indexes on multiple keys
 http://www.csbio.unc.edu/mcmillan/Comp521F16/Lecture16.pdf
5.1 INTRODUCTION
Files and documents are digitally recorded and kept in a storage system for future use, which is what
data storage entails. If necessary, storage systems may rely on electromagnetic, optical, or other media
to maintain and recover data. Data storage makes it simple to backup information for safekeeping and
speedy recovery in the case of a computer catastrophe or cyberattack.
Data can be stored physically on hard drives, disc drives, USB drives, or digitally on the cloud. The crucial
thing is ensuring your files are backed up and easily accessible in the event that your systems fail beyond
repair. In terms of data storage, some of the most significant variables to consider are dependability, the
robustness of security features, and the cost to construct and maintain the infrastructure. Examining
various data storage options and applications will assist you in making the right decision for your
company’s needs.
An index (IDX) is a data structure established on columns in a database table to significantly speed up
data retrieval operations in database systems. An index is a subset of a database table that is sorted
by key values. In the absence of an index, query languages such as SQL may have to traverse the whole
table from top to bottom to choose relevant rows. Indexes on database tables must be established with
caution since they take space and are kept in the database. Because the index must be updated when the
underlying table is modified, an index might cause write operations to take longer.
Indexes are created using various data structures such as B+ trees, balanced trees, and hashes. A
scan of the full database table is quite simple when using these data formats. Indexes on functions or
expressions can also be built. A unique index avoids duplicate entries in the index and the associated
table, functioning as a constraint. The arrangement of the columns is also a crucial part of indexing.
The first indexed column is always prioritised; row IDs are obtained using the first indexed columns. In
most cases, indexes do not deliver a value but rather find records.
2
UNIT 05: Storage and Indexing JGI JAIN
In general, the two types of index architecture are as follows:

 Clustered indexes: Keep row data in chronological order. On a database table, only one clustered
index may be built. This works well only if the data is sorted in ascending and descending order or
if a limit is set on the columns in the table. A sequential ordering of data on drives minimises the
number of block reads
 Non-clustered indexes: Organise data at random, yet the index specifies a logical ordering inside.
As a result, the index order differs from the actual ordering of data. By default, a “create index”
statement builds a non-clustered index as well as a clustered index on the main key. The index keys
are ordered in this order, with the leaf providing a page pointer. Non-clustered indexes perform
effectively in tables where data is often changed and the index is generated on columns utilised in
the order of WHERE and JOIN statements.
5.2 OPERATION ON FILES

File operations are often divided into two categories: retrieval operations and update operations. The
former makes no changes to the file’s contents but merely locates certain entries so that their field values
may be reviewed and processed. The latter alters the file by adding or deleting records or changing field
values. In each state, we may need to choose one or more records for retrieval, deletion, or change based
on a selection condition (or filtering condition) that describes the criteria that the desired record or
records must meet.
Consider the following EMPLOYEE file: Name, SSN, Salary, Job_Code, and Department. A basic selection
criterion might be an equality comparison on a field value, such as (Ssn = ‘123456789’) or (Department
= ‘Research’). Other sorts of comparison operators, such as > or ≥ can be used in more complicated
situations, as in (Salary ≥ 30000). In general, the selection condition is an arbitrary Boolean expression
on the file’s fields.
File search processes are often based on basic selection criteria. The DBMS (or the programmer) must
break down a complicated condition to obtain a basic condition that may be utilised to find records on a
disc. Each found record is then verified to see if it meets the whole selection requirement. For example, we
may separate the simple condition (Department = ‘Research’) from the complicated condition ((Salary
≥ 30000) and (Department = ‘Research’)); each record satisfying (Department = ‘Research’) is found and
evaluated to determine if it also meets (Salary ≥ 30000).
When numerous file records meet a search requirement, the first record in the physical sequence of file
records is discovered and marked as the current record. Subsequent search operations begin with this
record and find the next record in the file that meets the requirement.
The actual processes for identifying and accessing file records differ from one system to the next. We
give a selection of sample operations below. These instructions are often used by high-level programs,
such as DBMS software applications, to access records, hence we refer to program variables in the
following descriptions:
 Official: This function prepares the file for reading or writing. Allocates sufficient buffers (usually
at least two) to contain disc file blocks and obtains the file header. Sets the file pointer to the file’s
beginning.
3
JGI JAIN
 Restart: Sets the file pointer of an open file to the file’s beginning.
 Find (or find out): Looks for the first record that meets a search requirement. Transfers the record’s
block into the main memory buffer (if it is not already there). The file pointer points to the buffer
record, which becomes the current record. Different verbs are sometimes used to specify whether
the found record is to be retrieved or modified.
 Read (or obtain): Copies the current record from the buffer to a user-defined program variable. This
operation may also advance the current record pointer to the next record in the file, necessitating
the reading of the next file block from the disc.
 Identify next: Finds the next entry in the file that matches the search criterion. Transfers the record’s
block into the main memory buffer (if it is not already there). The record is placed in the buffer and
is now the current record. In legacy DBMSs based on the hierarchical and network models, many
variations of Find Next are available (for example, Find Next record within a current parent record,
Find Next record of a specified kind, or Find Next record when a complicated condition is satisfied).
 Remove: Deletes the current record and (at some point) modifies the disc file to reflect the deletion.
 Adaptation: Changes certain field values for the current record and (at some point) changes the file
on the disc to reflect the change.
 Insert: Insert a new record into the file by finding the block to be inserted, moving that block into
the main memory buffer (if it is not already there), writing the record into the buffer, and (finally)
copying the buffer to disc to reflect the insertion.
 Close: Completes the file access by releasing the buffers and doing any other clean-up procedures
that are required. Except for Open and Close, the preceding activities are known as record-at-a-time
operations since each one applies to a single record. It is possible to combine the procedures Find,
Find Next, and Read into a single operation called Scan, which is described below.
 Scanner: Scan returns the first record if the file was recently opened or reset; otherwise, it returns
the next record. If a condition is supplied, the returned record is the first or next record that meets
the criteria. In database systems, a file can be subjected to extra set-at-a-time higher-level actions.
The following are some examples:
 Find all: Finds all of the records in the file that match a search criterion.
 N find (or locate): Searches for the first record that meets a search requirement and then moves
on to the following n – 1 record that meet the same condition. The blocks holding the n records are
transferred to the main memory buffer (if not already there).
 Find ordered: Returns all of the records in the file in the order requested.
 Reorganization: Begin the process of restructuring. As we will see, some file systems need periodic
rearrangement. As an example, suppose you want to reorganise the file entries by sorting them by
a certain field.
 Unordered record files (heap files): Entries are stored in the file in the order in which they are
inserted in this simplest and most basic style of organisation, therefore fresh records are added at
the end of the file. A heap or pile file is a type of file that organises data in unordered manner. This
arrangement is frequently used in conjunction with extra access pathways. It is also used to collect
and save data records for future reference.
4
Inserting a new record is a fairly efficient operation. The file’s last disc block is copied into a buffer,
the new record is appended, and the block is then overwritten to disc. The file header contains the
address of the final file block. However, looking for a record using any search criterion necessitates
a linear search over the file block, which is a time-consuming technique. If just one record matches
the search requirement, the software will typically read into memory and search half the file blocks
before finding the record. On average, for a file of b blocks, this involves searching (b/2) blocks. If no
records or multiple records fulfill the search criterion, the software must read and search the whole
file’s b blocks.
To remove a record, software must first locate its block, copy it into a buffer, delete the record from
the buffer, and then rewrite the block to the disc. This frees up space on the disc block. This method of
deleting a large number of data wastes storage space. Another method for record deletion is to store
an additional byte orbit, known as a deletion marker, with each record. Setting the deletion marker
to a certain value deletes a record. A different value for the marker shows that the record is still valid
(it has not been destroyed). When executing a search, search systems evaluate only valid records
in a block. Both of these deletion methods need periodic file restructuring in order to recoup the
wasted space of deleted entries. The file blocks are read sequentially throughout the reorganisation,
and records are packed by eliminating deleted entries. Following such a restructuring, the blocks
are once again filled to capacity. Another option is to insert new records into the space left by deleted
records, albeit this involves more accounting to keep track of vacant spots.
For an unordered file, we may use either spanned or unspanned organisation, and it can be used
with either fixed-length or variable-length entries. Because the new record may not fit in its original
place on the disc, modifying a variable-length record may necessitate removing the old record and
inserting a modified record.
 Ordered records file (sorted files): We can physically arrange the records in a file on the disc by
the values of one of its fields, known as the ordering field. This results in a file that is ordered or
consecutive. If the ordering field is also a key field of the file, that is, a field that is guaranteed to have
a unique value in each record, the field is referred to as the ordering key for the file.
5.3 HASHING TECHNIQUES

Another sort of main file structure is based on hashing, which allows for extremely rapid access to
entries under specific search conditions. This type of arrangement is sometimes referred to as a hash
file. The equality criterion on a single field, known as the hash field, must be used as the search condition.
In most circumstances, the hash field is also a file key field, in which case it is referred to as the hash key.
The objective behind hashing is to offer a function h, known as a hash function or randomising function,
that is applied to a record’s hash field value and returns the address of the disc block in which the record
is stored. The main memory buffer can be searched for the record within the block. Most records require
only single-block access to be retrieved.
Internal hashing is a term used to describe the process for internal files, hashing is often done as a hash
table using an array of records. Assume the array index range extends from 0 to M – 1 and then we have
M slots whose addresses correspond to the array indexes. We select a hash function that converts the
hash field value to an integer between 0 and M 1. The h(K) = K mod M function, for example, returns the
5
JGI JAIN
remainder of an integer hash field value K after division by M; this value is then utilised for the record
address. Figure 1 depicts the hashing techniques:
Hashing Techniques
Internal hashing data structures. (a) Array of M Positions for use in internal
hashing. (b) Collision resolution by chaining records.
(a) Name Ssn Job Salary
0
1
2
3
M=2
M-1
Overflow pointer
(b) 0
-1
M
1
-1
2
-1
3
M+2
4
Address space
M+1
M-2
-1
M-1
M+5
M
-1
M+1
M+4
M+2
Overflow space
M+0-2
M+0-1
Null pointer= -1
Overflow pointer refers to position of next record in linked list
Figure 1: Hashing Technique
Before using the mod function, no integer hash field values can be converted to integers. The numeric
(ASCII) codes associated with characters can be employed in the transformation for character strings,
for example, by multiplying such code values. for a hash field whose data type is a 20-character string.
We will suppose that the code function produces a character’s numeric code and that we are given a
hash field value K of type K: array [1.20] of char (in Pascal) or char K [20] (in C).
6
5.3.1 External Hashing for Disk Files

External hashing refers to the process of hashing disc files. The target address space is divided into
buckets, each of which carries numerous records, to accommodate the features of disc storage. A bucket
can be either a single disc block or a group of contiguous disc blocks. Rather than allocating an absolute
block address to the bucket, the hashing function converts a key to a relative bucket number. A table
in the file header transforms the bucket number into the matching disc block address. With buckets,
the collision problem is less severe since as many records as can fit in a bucket can hash to the same
bucket without producing difficulties. However, we must plan for the situation in which a bucket is full
and a new record is placed hashes to that bucket. We may utilise a chaining version in which a pointer
is retained in each bucket to a linked list of overflow entries for the bucket. The linked list pointers
should be record pointers, which provide a block address as well as a relative record position inside the
block. Given the value of a record’s hash field, hashing gives the quickest feasible access to that record.
Although most excellent hash functions do not save records in the order of hash field values, some
do. These functions are known as order-preserving hash functions. An example of an order-preserving
hash function would be to use the hash address as the leftmost three digits of an invoice number field
that returns a bucket address and maintain the entries sorted by invoice number inside each bucket.
Figure 2 depicts external hashing:
Bucket
Number Block address on disk
0
M-2
M-1
Maching bucket numbers to disk

block addresses
Figure 2: External Hashing
7
JGI JAIN
For example, if employee numbers at a firm are assigned as 1, 2, 3,... up to the entire number of workers,
we may use the identity hash function to keep order. Unfortunately, this only works if keys are created
in the correct sequence by some program.
Because a set number of buckets M are assigned, the hashing algorithm described thus far is known as
static hashing. This can be a significant disadvantage for dynamic files. Assume we have M buckets for
the address space and m is the maximum number of records that can fit in one bucket. Then only (m *
M) records will fit in the available area. If the number of records is significantly less than (m * M), we are
left with a lot of wasted space. On the other hand, if the number of records exceeds (m * M), there will be
multiple collisions and retrieval will be hindered due to the large lists of overflow data.
In either instance, we may need to adjust the number of blocks M allocated and then redistribute
the records using a new hashing function (depending on the new value of M). For huge files, these
reorganisations might take a long period. Newer hashing-based dynamic file organisations allow the
number of buckets to alter dynamically with just localised rearrangement, as shown in Figure 3:
Main buckets
Bucket 0 340
360
Record pointer
NULL
Overflow buckets
Bucket 0 321 881 Record pointer
761 Record pointer NULL
81 182 Record pointer
Record pointer
Bucket 2 22 652 Record pointer
72 Record pointer
NULL
522 Record pointer
Record pointer
{Pointers are to records within the overflow blocks
Bucket 3 399
89
Handling overflow for Buckets by Chaining

Record pointer
NULL
Figure 3: Handling Overflow for Buckets by Chaining
8
Modifying the Field Value of a Given Record is Dependent on Two Factors

The search criterion is used to find that specific record, as well as the field to be updated. If the search
criteria is an equality comparison on the hash field, we can easily identify the record by utilising the
hashing algorithm; otherwise, we must perform a linear search. Change the record and rewrite it in
the same bucket to modify a non-hash field. When the hash field is changed, the record might be moved
to a different bucket, which necessitates the deletion of the old record followed by the insertion of the
amended record.
Hashing Techniques that Allow Dynamic File Expansion
One significant disadvantage of the previously described static hashing approach is that the hash
address space is fixed. As a result, dynamically expanding or shrinking the file is challenging. The first
approach, extendible hashing, save an access structure in addition to the file and is therefore comparable
to indexing in certain ways. The primary distinction is that the access structure is dependent on the
values returned by the hash function when applied to the search field. The access structure in indexing
is determined by the values of the search field itself. The second method, known as linear hashing, does
not require any additional access structures. Another approach, known as dynamic hashing, employs
an access mechanism based on binary tree data structures.
Hashing that is extensible. In extendible hashing, a kind of directory an array of 2d bucket addresses is
kept, where d is the directory’s global depth. The integer value corresponding to the first (high-order) d
bits of a hash value is used to determine a directory entry, and the address in that entry identifies the
bucket in which the associated records are stored. However, each of the 2d directory locations does not
have to have its own bucket. If all the records that hash to these locations fit in a single bucket, many
directory locations with identical initial d bits for their hash values may have the same bucket address.
The number of bits on which the bucket contents are based is specified by a local depth d kept with each
bucket. A directory with a global depth of d = 3.
The value of d can be changed one at a time, thus doubling or halving the number of items in the directory
array. If a bucket with a local depth d is equal to the global depth d overflows, doubling is required. If d
> d for all buckets after certain deletions, the halving happens. The majority of record retrievals need
two-block visits, one to the directory and one to the bucket.
Assume that a newly added record produces an overflow in the bucket whose hash values begin with
01 the third bucket demonstrate bucket splitting. The records will be divided into two buckets: the first
will contain all records whose hash values begin with 010, and the second will contain all records whose
hash values begin with 011. The two new separate buckets are now pointed to the two directory locations
010 and 011. They used to point to the same bucket before the divide.
The local depth of the two new buckets is three, one greater than the previous bucket’s local depth. If
an overflowing and split bucket is used to have a local depth equal to the directory’s global depth d,
the directory’s size must now be twice so that one may utilise an additional bit to identify the two new
buckets. For example, if the bucket for records with hash values beginning with 111 overflows, the two
new buckets require a directory with global depth d = 4, because the two buckets are now designated
1110 and 1111, and hence their local depths are also 4. As a result, the directory size is doubled, and all of
the other original locations that are included are divided into two places, each of which has the same
pointer value as the original location.
9
JGI JAIN
Figure 4 depicts structure of the extensible hashing scheme:
Local depth of
Directory back bucket Date file buckets
000 d=3
Bucket for records
001
whose hash values
010
start with 000
011
100
101 d=3 Bucket for records
110 whose hash values
111 start with 001
Global depth
d=3
d=2
Bucket for records
whose hash values
start with 01
d=2
Bucket for records
whose hash values
start with 010
d=3 Bucket for records

whose hash values
start wit110
d=3
Bucket for records
whose hash values
Structure of the extensible start with 111
hashing scheme.
Figure 4: Structure of the Extensible Hashing Scheme
5.3.2 Dynamic Hashing

Dynamic hashing was a forerunner of extendible hashing, in which the addresses of the buckets were
either the n high-order bits or the n 1 high order bits, depending on the total number of keys belonging
to the particular bucket. The ultimate storing of records in buckets for dynamic hashing is analogous
to extendible hashing in several ways. The main distinction is in the way the directory is organised.
Whereas extendible hashing employs global depth (high-order d bits) for the flat directory and then
merges nearby collapsible buckets into a bucket of local depth d -1, dynamic hashing keeps a tree-
structured directory with two sorts of nodes:
 Internal nodes: Nodes with two pointers, the left corresponding to the 0 bit (in the hashed address)
and the right corresponding to the 1 bit
 Leaf nodes: These include a reference to the actual bucket containing entries. Four buckets
(“000”, “001”, “110”, and “111”) with high-order 3-bit addresses (corresponding to the global depth
of 3) are displayed, and two buckets (“01” and “10”) with high-order 2-bit addresses are presented
(corresponding to the local depth of 2). The last two are the result of combining “010” and “011” to
get “01” and “100” and “101” to become “10”. In dynamic hashing, the directory nodes are implicitly
utilised to compute the “global” and “local” depths of buckets. The search for a record based on the
hashed address requires traversing the directory tree, which leads to the bucket containing that
record.
10
5.3.3 Linear Hashing

The goal behind linear hashing is to allow a hash file to automatically increase and reduce its number of
buckets without the requirement for a directory. Assume the file begins with M buckets numbered 0, 1...
M -1 and employs the mod hash function h(K) = K mod M; this hash function is referred to as the initial
hash function hi. Overflow due to collisions is still required and may be addressed by keeping separate
overflow chains for each bucket. When a collision causes an overflow record in any file bucket, the initial
bucket 0 is divided into two buckets: the original bucket 0 and a new bucket M at the end of the file. The
entries in bucket 0 are divided into two buckets using a separate hashing technique hi+1(K) = K mod 2M.
A distinguishing feature of the two hashes. Figure 5 depicts structure of the dynamic hashing scheme:
Internal director node Data File Buckets
Leaf derectory node Bucket for records

whose hash values
start with 000
Directory
Bucket for records
whose hash values
start with 001
Bucket for records

whose hash values
start with 01
Bucket for records

whose hash values
start with 010
Bucket for records

whose hash values
start with 110
Bucket for records

whose hash values
Structure of the dynamic hashing scheme start with 111
Bucket for records

whose hash values
start with 111
Figure 5: Structure of Dynamic Hashing Scheme

Function hi and hi+1 is that all records hashed to bucket 0 by hi will hash to either bucket 0 or bucket M
by hi+1; this is required for linear hashing to work.
Additional buckets are separated in the linear order 1, 2, 3 ... when more collisions result in overflow
records. If enough overflows occur, all of the original file buckets 0, 1,..., M - 1 will have been split,
resulting in the file now having 2M buckets instead of M, with all buckets using the hash function hi+1.
As a result, the overflow records are finally redistributed into regular buckets through the function
11
JGI JAIN
hi+1 via a delayed split of respective buckets. There is no directory; all that is required to detect which
buckets have been divided is a variable n, which is initially set to 0 and is increased by 1 every time a
split happens. To get a record with the hash key value K, first, apply the function hi to K; if hi (K) n, then
apply the function hi+1 on K since the bucket has already been divided. Initially, n = 0, indicating that
the function hi applies to all buckets; when buckets are separated, n rises linearly. When n equals M after
being incremented, it means that all of the original buckets have been divided, and the hash function
hi+1 applies to all data in the file. At this point, n is reset to 0 (zero), and any subsequent collisions that
result in the overflow will result in the application of a new hashing function hi+2(K) = K mod 4M. In
general, a succession of hashing functions hi+j (K) = K mod (2jM) is employed, where j = 0, 1, 2,...; a new
hashing function hi+j+1 is required every time all buckets 0, 1,..., (2jM) 1 are divided and n is reset to 0..
5.4 TYPES OF SINGLE-LEVEL ORDERED INDEXES

An ordered index is comparable to a textbook index, which displays essential terms at the conclusion
of the book in alphabetical order, coupled with a list of page numbers where the phrase appears in the
book. We may utilise the book index to retrieve a list of addresses page numbers in this scenario and
then use these addresses to locate the given pages first and then search for the phrase on each defined
page. If no additional assistance is provided, the alternative would be to sift methodically through
the whole textbook word by word to identify the phrase of interest; this is equivalent to doing a linear
search, which examines the whole file. Of course, most books have extra information, such as chapter
and section names, which allow us to discover a phrase without having to search the entire book. The
index, on the other hand, is the sole precise indication of the pages in the book where each phrase
appears.
An index access structure is often specified on a single field of a file, known as an indexing field, for a
file with a specific record structure consisting of numerous fields (or attributes) (or indexing attribute).
Each index field value is normally stored in the index, along with a list of pointers to all disc blocks that
contain records with that field value. The index values are arranged so that we may do a binary search
on the index. If both the data file and the index file are ordered, and because the index file is often
considerably smaller than the data file, a binary search is a superior alternative for searching the index.
Tree-structured multilevel indexes apply a binary search extension that decreases the search space by
2-way partitioning at each search phase, resulting in a more efficient strategy that splits the search
space in the file n-ways at each stage.
Ordered indexes are classified into numerous categories. On the ordering key field of an ordered file of
records, a primary index is given. The ordering key field is used to physically arrange the file records on
disc, and it has a unique value for each record. If the ordering field is not a key field, that is, if several
entries in the file might have the same ordering field value, a clustering index can be utilised. In this
situation, the data file is referred to as a clustered file. Because a file can only have one physical ordering
field, it can only have one main index or one clustering index, but not both. A secondary index, the
third form of index, can be set on any non-ordering field of a file. In addition to its principal access
mechanism, a data file might have many secondary indexes.
5.4.1 Primary Indexes

The primary index is an ordered file with fixed-length records and two fields that serves as an access
structure to quickly search for and retrieve data records in a data file. The first field is of the same data
type as the data files ordering key field, known as the primary key, while the second field is a reference
to a disc block (a block address). Each block in the data file has one index entry (or index record) in the
index file. As its two field values, each index entry has the value of the primary key field for the first
record in a block and a reference to that block of index entry i as <K(i),P(i)>.
12
UNIT 05: Storage and Indexing JGI JAIN DEEMED-TO-BE UNIVERSIT Y
Figure 6 depicts block of an ordered sequential file of employee records:
Some blocks of an ordered Name Ssn Birth_date Job Salary Sex
(sequential) file off EMPLOYEE Block 1 Aaron, Ed

Abbott, Diane
records with Name as the
ordering key field. Adams, John
Block 2 Acosta, Marc

Adams, Robin
Akers, Jan
Block 3 Alexander, Ed
Alfred, Bob
Allen, Sam
Block 4 Allen, Troy

Anders, Keith
Anderson, Rob
Block 5 Anderson, Zach

Angeli, Joe
Archer, Sue
Block 6 Arnold, Mack

Arnold, Steven
Atkins, Timothy
Block n-1 Wong, James

Woods, Manny
Wood, donald
Block n Wright, Pam

Zimmer, Byron
Wyatt, Charles
Figure 6: Block of an Ordered Sequential File of Employee Records

We use the Name field as the primary key to establishing a primary index on the ordered file displayed,
since it is the ordering key field of the file (assuming that each value of Name is unique). Each indexed
item has a Name value and a pointer. The following are the first three index entries:
<K(1) = (Aaron, Ed), P(1)=address of block 1>
<K(1) = (Aaron, Ed), P(2)==address of block 2>
<K(1) = (Aaron, Ed), P(3)==address of block 3>
This basic index is depicted in the above Figure 6. The entire number of index entries equals the number
of disc blocks in the ordered data file. The initial record in each data file block is known as the anchor
record of the block, or simply the block anchor.
Indexes are also classified as dense or sparse. Every search key value (and hence every record) in the
data file has an index entry in a dense index. In contrast, a sparse (or non-dense) index has index entries
for just part of the search values. The number of items in a sparse index is less than the number of
records in the file. Thus, the main index is a non-dense (sparse) index because it contains entries for
each disc block of the data file and the keys of its anchor record rather than for each search value (or
every record).
13
JGI JAIN
For two reasons, the index file for the main index takes up significantly less size than the data file. For
starters, there are fewer index entries than records in the data file. Second, because each index entry
has just two fields, it is often less in size than a data record; as a result, more index entries than data
records can fit in a single block. As a result, a binary search on the index file necessitates fewer block
visits than a binary search on the data file, is shown in Table 1:
Table 1: Average Access Times for a File of b Blocks under Basic File Organisations
Type of Organisation Access/Search Method Average Blocks to Acess a Apecific Record
Heap (unordered) Sequential scan (linear search) b/2
Ordered Sequential Scan b/2
Ordered Binary search log2b
Note that Log2 b block accesses were required for the binary search for an ordered data file. However,
if the primary index file only includes bi blocks, then using a search key to locate a record is impossible.
The primary indexes on the ordering key field of the file is shown in Figure 7:
(Primary
key field)
Primary index on the ordering key field of
the file shown in figure 17.7 Name Ssn Birth_date Job Salary Sex
Aaron, Ed
Abbott, Diane
Adams, John
Acosta, Marc
Adams, Robin
Akers, Jan
Index File Alexander, Ed

(<k(i),P(i)> entries) Alfred, Bob
Block anchor Allen, Sam

primary key Block
value pointer Allen, Troy
Anders, Keith
Anderson, Rob
Anderson, Zach
Angeli, Joe
Archer, Sue
Arnold, Mack
Arnold, Steven
Atkins, Timothy
Wong, James
Woods, Manny
Wood, donald
Wright, Pam
Zimmer, Byron
Wyatt, Charles
Figure 7: Primary Index on the Ordering Key Field of the File
14
A binary search of that index and access to the block storing that record are required for value: a total
of log2 bi + 1 access.
A record with the primary key value K is located in the block with the address P(i), where K(i) ≤ K < K (i +
1). Because of the physical ordering of the file entries based on the primary key field, the ith block in the
data file includes all such records. To obtain a record, we do a binary search on the index file to discover
the relevant index entry I and then retrieve the data file block whose address is P. (i). Example shows the
reduction of block accesses that may be obtained when the main index is utilised to search for a record.
The addition and deletion of records is a key issue with the main index, as it is with any ordered file.
With the main index, the problem is exacerbated since inserting a record in its right place in the data
file necessitates not only relocating records to make room for the new record but also changing certain
index entries, because shifting records changes the anchor records of some blocks. This problem can
be mitigated by using an unordered overflow file. Another option is to utilise a linked list of overflow
entries for each data file block. This is comparable to how hashing is used to cope with overflow records.
To optimise retrieval performance, records within each block and its overflow linked list can be sorted.
Deleted records are handled using deletion marks.
5.4.2 Clustering Indexes

If file entries are physically sorted on a non-key field with no distinct value for each record, that field is
referred to as the clustering field, and the data file is referred to as a clustered file. To speed up retrieval
of all records that have the same value for the clustering field, we may develop a different sort of index
called a clustering index. The main index, on the other hand, demands that the ordering field of the data
file have a unique value for each entry.
A clustering index is similarly an ordered file with two fields in which, the first is of the same type
as the data file’s clustering field, and the second is a disc block pointer. Each different value of the
clustering field has one item in the clustering index, which contains the value as well as a link to the first
block in the data file that has a record with that value for its clustering field. Because the data records
are physically ordered, record addition and deletion continue to cause issues. To relieve the insertion
difficulty, it is standard practice to reserve a whole block (or a cluster of contiguous blocks) for each
value of the clustering field; all entries with that value are inserted in the block (or block cluster). This
makes insertion and deletion quite simple.
5.4.3 Secondary Indexes

A secondary index gives a secondary method of accessing a data file that already has a primary method
of access. The records in the data file might be sorted, unordered, or hashed. The secondary index can
be built on a candidate key field with a unique value in every record, or on a non-key field with duplicate
values. The index, like the previous one, is an ordered file with two fields. The first field is of the same
data type as an indexing field in a non-ordering field of the data file. The second field might be either
a block or a record reference. Many secondary indexes (and hence indexing fields) can be built for the
same file each one offers an extra way to access that file depending on a certain field.
15
A secondary index access structure based on a key (unique) field with a different value for each entry.
A field of this type is commonly referred to as a secondary key; in the relational model, this would
correspond to any UNIQUE key property or the main key attribute of a table. In this example, each
record in the data file has one index entry, which contains the value of the field for the record as well as
a reference to the block in which the record is placed or to the record itself. Hence, such an index is dense.
Figure 8 depicts secondary indexes:
(Clustening
field)
Deps number Name Ssn Birth_date Job Salary

1
1
1
2
2
2
3
3
Index File 3
(<k(i),P(i)> entries) 4
4
Clustening Block
field value pointer
5
1 5
2 6
6
3
4
5 6
6
6
6
8
6
6
8
A clostening index on the Deps number ordering nonkey 8
field of an EMPLOYEE file.
8
Figure 8: Secondary Indexes
Again, we refer to the two-index entry I field values as <K(i), P(i). The elements are arranged by the value
of K(i), allowing us to do a binary search. We cannot utilise block anchors because the entries in the
data file are not physically ordered by the values of the secondary key field. As a result, an index entry
is constructed for each record in the data set.
16
Figure 9 depicts clustering index with separate block cluster:
Clostering order with a seperate (Clustening Data File

block cluster for each group of field)
records that share the same value Deps number Name Ssn Job Birth_date Salary
for the clustering field.
1
1
1
Block pointer
NULL pointer
2
2
Block pointer
NULL pointer
3
3
3
3
Index File Block pointer
(<K(i),P(i)> entries)
3
Clustening Block Block pointer
field value pointer
1 NULL pointer
4
2 4
3
4 Block pointer
5
5 NULL pointer
6
8 5
5
5
Block pointer
NULL pointer
6
6
6
6
Block pointer
Block pointer
NULL pointer
8
8
8
Block pointer
NULL pointer
Figure 9: Clustering Index with Separate Block Cluster
As opposed to the main index, which is used for each block, a secondary index is used for the whole
file. A secondary index with pointers P(i) in the index elements that are block pointers rather than
record pointers. After transferring the necessary disc block to the main memory buffer, a search for the
requested record inside the block may be performed.
17
JGI JAIN
Figure 10 depicts a dense secondary index:
A dense secondary index (with block pointers) on a nonordering key field of a file.
Index
field value Data file
Index File
Indexing field
(<K(i),P(i)> entries)
( secondary key field
Index Block 8
field value pointer
5
1
13
2
8
3
4
6
5
19
6
13
7
17
8
21
9 11
10 16
11 2
12
13 12
14
7
15
18
16 22
17 20
18 20
19 20
20 20
21
22 20
23 20
24 20
20
Figure 10: Secondary Dense Index
Because of the increased number of items, a secondary index often requires more storage space and a
longer search time than a primary index. However, the improvement in search time for an arbitrary
record is significantly higher for a secondary index than for the main index, since if the secondary index
18
did not exist, we would have to execute a linear search on the data file. Even if the index did not exist, we
could still do a binary search on the main file to find it.
5.5 MULTI-LEVEL INDEXES

So far, we’ve discussed indexing techniques that use an ordered index file. A binary search is performed
on the index to find pointers to a disc block or a record (or records) in the file with a given index field
value. For an index of bi blocks, a binary search consumes about (log2 bi) block visits since each step
of the algorithm lowers the portion of the index file that we continue to search by a factor of two. This
is why we use the log function with a base of 2. The goal behind a multilayer index is to decrease the
portion of the index that we continue to search by b fri, the index’s blocking factor that is more than 2.
As a result, the search space is decreased significantly more quickly.
The value b fri is known as the fan-out of the multilevel index, and it is denoted by the sign (fo). Unlike a
binary search, which divides the record search space into two halves at each step, the multilevel index
divides it n-ways (where n = the fan-out) at each search step. If the fan-out is more than 2, searching
a multilevel index consumes roughly (logfobi) block accesses, which is a much lower amount than
searching a binary index. Most of the time, the fan-out is substantially greater than 2. A multilayer
index regards the index file, which we will now refer to as the first (or base) level of a multilevel index, as
an ordered file with a unique value for each K. (i).
As a result, by treating the first-level index file as a sorted data file, we may generate a main index for
the first level; this index to the first level is referred to as the multilevel index’s second level. Because the
second level is the main index, we may utilise block anchors to have one item in the second level for each
block in the first level. Because all index entries are the same size (one field value and one block address),
the blocking factor bfri for the second level (and all future levels) is the same as for the first. If the first
level of the index contains r1 entries and the blocking factor (also known as the fan-out) for the index is
bfri = fo, then the first level requires [(r 1 /fo)] blocks, which is the number of entries r2 required at the
second level of the index.
This procedure can be repeated for the second level. The third level, which is a main index for the second
level, contains one entry for each second-level block, thus r 3 = [(r 2 /fo)]. We need a second level only if
the first level requires more than one block of disc storage, and we need a third level only if the second
level requires more than one block. We can repeat this method until all of the entries of some index level
t fit in a single block.
The top index level refers to this block at the level. Because each level lowers the number of entries at the
preceding level by a factor of (fo) the index fan-out we may determine t using the formula 1 (r 1 / ((fo) t)).
As a result, a multilevel index with r 1 first-level entries will contain t levels, where t = [(logf0 (r 1)]). At
each level of the index, a single disc block is retrieved. As a result, for an index search, t disc blocks are
accessed, where t is the number of index levels.
The multilevel method presented here may be used for any sort of index, whether primary, clustering,
or secondary, as long as the first-level index contains separate K(i) values and fixed-length entries. A
multilayer index constructed over the main index.
19
JGI JAIN
An example shows how using a multilayer index to search for a record improves the number of blocks
accessed, as shown in Figure 11:
Two-level index Data file
First (base) Primary key

level field
2 2
8 5
18
24 8
12
15
21
24
Second (top) 29
level
2 35 35
35 39 38
68 44
39
85 51
41
44
46
51
52
55 55
89 58
71
96
80
68
71
78
80
85
82
86
88
Figure 11: Two Layer Indexes
5.6 DYNAMIC MULTI-LEVEL INDEXES USING B TREES AND B+ TREE

B-trees and B+ trees are variants of the well-known tree search data structure. We provide a basic
overview of the terminology used while discussing tree data structures. A tree is made up of nodes.
Except for a unique node called the root, each node in the tree has one parent node and zero or more
child nodes. There is no parent node for the root node. A leaf node is one that has no child nodes; a non-
leaf node is known as an internal node. A node’s level is always one higher than that of its parent, with
the root node’s level being zero. A node’s subtree is made up of that node and all of its descendent nodes
its child nodes, the child nodes of its child nodes, and so on. A subtree is precisely defined as consisting
20
of a node n and the subtrees of all n’s child nodes. The root node in the Figure 12 is A, and its child nodes
are B, C, and D. Leaf nodes are E, J, C, G, H, and K. Because the leaf nodes are located at different levels of
the tree, this tree is said to as imbalanced. Figure 12 depicts dynamic multi-level indexes using B-Trees
and B+ Trees:
Subtree for node B A Root node (level 0)
B C D Nodes at level 1
E F G H I Nodes at level 2
J K Nodes at level 3
(Nodes E, J, C, G, H, and K are leaf nodes of the tree)

A tree data structure that shows an unbalanced tree.
Figure 12: Dynamic Multi-Level Indexes using B-Tress and B+ Trees
We start with search trees and then go on to B-trees, which may be used as dynamic multilevel indexes
to help direct the search for entries in a data file. B-tree nodes are kept between 50 and 100 percent filled,
and data block references are saved in both internal and leaf nodes of the B-tree structure. B+-trees is a
variant of B-trees in which pointers to a file’s data blocks are only maintained in leaf nodes, resulting in
fewer levels and higher-capacity indexes. The B+-tree structure is the most often used indexing structure
in today’s DBMSs.
5.6.1 B-Trees and Search Trees

A search tree is a form of a tree that guides the search for a record based on the value of one of the
record’s fields. The multi-level indexes are similar to search trees in that each node in the multilevel
index can hold up to f0 pointers and f0 key values, where f0 is the index fan-out. The index field values in
each node direct us to the next node until we reach the data file block containing the necessary entries.
By following a pointer, we limit our search to a subtree of the search tree at each level and ignore all
nodes that are not in this subtree. Look for trees. A search tree differs somewhat from a multilayer
index. A search tree of order p is a tree in which each node holds at most p - 1 search values and p
pointers in the order specified.
<P1, K1, P2, K2… Pq-1, Kq-1, pq, where q ≤ p. Each Pi is a pointer to a child node (or a NULL pointer), and
each Ki is a search value from some ordered set of values. All search values are assumed to be unique.
Two constraints must hold at all times on the search tree:
 Within each node, K1 < K2 < ... < Kq−1
 For all values X in the subtree pointed at by Pi, we have Ki−1 < X < Ki for 1 < i < q; X < Ki for i = 1; and
Ki−1 < X for i = q
21
JGI JAIN
Whenever we look for a value X, we use the calculations in condition 2 above to follow the relevant
pointer Pi. A search tree with an order of p = 3 and integer search values. It should be noted that some
of the pointers Pi in a node may be NULL pointers.
To look for entries in a disc file, we can use a search tree as a technique. The values in the tree can be
the contents of one of the file’s fields, known as the search field (which is the same as the index field if a
multilevel index guides the search). Each key value in the tree corresponds to a reference to the record in
the data file that contains that value. The pointer might also be to the disc block containing that record.
By designating each tree node to a disc block, the search tree may be saved on a disc. When a new record
is added to the file, we must update the search tree by adding an entry to the tree that contains the new
record’s search field value as well as a reference to the new record. Figure 13 depicts a node on a search
tree with pointers:
P1 K1 ... Ki-1 Pi Ki ... Kq-2 Pq

A node in a search
tree with pointers to
P1
subtrees below it.
X X X
X<K1 Ki-1<X<Ki Kq-1<x
Tree node pointer

A search tree of
order p=3 Null tree pointer
8
3 6 8
1 7 8 12
This restrictions can be relaxed, if the index is on a nonkey field, duplicate search refues may
exed and the nude structure and the navigation rules in the tree may be
Figure 13: A Node on a Search Tree with Pointers
Algorithms are required for adding and removing search values into and from the search tree while
adhering to the previous two requirements. In general, these techniques do not ensure that a search tree
is balanced, that all of its leaf nodes are at the same level. Because it has leaf nodes at levels 1, 2, and 3,
the tree is unbalanced. The following are the objectives for balancing a search tree:
 To ensure that nodes are distributed equally, such that the depth of the tree is reduced for the given
set of keys and the tree is not skewed with some nodes at extremely deep levels.
 To uniformly speed up the search, such that the average time to discover any random key is nearly
the same.
While.limiting the number of levels in the tree is one aim, another implicit goal is to ensure that the
index tree does not require excessive restructuring when entries are added to and deleted from the main
file. As a result, we want the nodes to be as full as possible, and we don’t want any nodes to be empty as a
result of too many removals. When records are deleted, certain nodes in the tree may become practically
empty, wasting storage space and increasing the number of levels. The B-tree solves both of these issues
by imposing extra limits on the search tree.
22
UNIT 05: Storage and Indexing JGI JAIN DEEMED-TO-BE UNIVERSIT Y
5.6.2 B-Trees
Additional restrictions in the B-tree ensure that the tree is always balanced and that the space spent
by deletion, if any, is never excessive. However, in order to retain these limits, the insertion and
deletion algorithms grow increasingly complicated. Nonetheless, most insertions and deletions are
straightforward procedures; they get hard only in rare cases, such as when we try an insertion into an
already full node or a deletion from a node that leaves it less than half full. In more technical terms, a
B-tree of order p can be described as follows when used as an access structure on a key field to search
for entries in a data file:
 Each internal node in the B-tree is of the form:
<P1,<K1, Pr1>, P2 ,<K2, Pr2>,…<Kq-1, Prq-1 , ..., , Pq >
where q ≤ p. Each Pi is a tree pointer—a pointer to another node in the Btree. Each Pri is a data
pointer8—a pointer to the record whose search key field value is equal to Ki (or to the data file block
containing that record).
 Within each node, K1 < K2 < ... < Kq−1.
 For all search key field values X in the subtree pointed at by Pi (the ith subtree, we have:
Ki–1 < X < Ki for 1 < i < q; X < Ki for i = 1; and Ki–1 < X for i
= q.
 Each node has at most p tree pointers. Figure 14 depicts (a) a node in a B-tree with q-1 search values
and (b) a B-tree of oder p=3:
(a) P1 K1 P1 P1 ... Ki-1 Pi Pi Ki Pi ... Ki-1 Pi Pq
Tree Tree
pointer pointer
Tree Data Data Data Data
pointer pointer pointer pointer pointer
X Tree X X
pointer
X<K1 Ki-1<X<Ki Kq-1<x
(b)
5 8 Tree node pointer
Data pointer
Null tree pointer
1 3 8 7 8 12
B-tree structures. (a) A node in a B-tree with q-1 search values (b) A B-tree
of order p=3. The values were inserted in the order 8, 5, 1, 7, 3, 12, 8, 6.
Figure 14: (a) A Node in a B-Tree with p-1 search values and (b) A B-tree of Order p=3:
 Except for the root and leaf nodes, each node contains at least [(p/2)] tree pointers. Unless it is the
sole node in the tree, the root node contains at least two tree pointers.
 A node that holds q tree pointers, q ≤ p, has q – 1 search key field values (and hence q – 1 data
pointers).
 All of the leaf nodes are on the same level. The structure of leaf nodes is the same as that of internal
nodes, except that all of their tree pointers Pi are NULL.
23
JGI JAIN
The above Figure 14 depicts a B-tree of order p = 3. Because we expected that the tree would be utilised
as an access structure on a key field, all search values K in the B-tree are unique. If we utilise a B-tree
on a non-key field, we must modify the definition of the file pointers Pri to refer to a block—or a cluster
of blocks—that contains the file record pointers. This additional degree of indirection is analogous for
secondary indexes.
At level 0, a B-tree has a single root node (which is also a leaf node) (zero). When the root node is filled
with p – 1 search key values and we try to put another item into the tree, it divides into two nodes at level
1. Only the middle value is retained in the root node, while the remaining values are distributed evenly
across the other two nodes. When a non-root node becomes full and a new entry is entered into it, the
node is split into two nodes at the same level, and the middle entry is relocated to the parent node along
with two pointers to the newly split nodes. If the parent node is full, it is divided as well. Splitting can
spread all the way to the root node, resulting in the formation of a new level if the root is split.
If a node becomes less than half-filled due to the deletion of a value, it is merged with its nearby nodes,
and this can even spread all the way to the root. As a result, deletion has the potential to lower the
number of tree levels. Analyses and simulations have revealed that after several random insertions
and deletions on a B-tree, the nodes are around 69 percent filled when the number of values in the
tree stabilises. This also applies to B+ trees. If this occurs, node splitting and combining will occur only
seldom, making insertion and deletion relatively efficient. If the number of values increases, the tree
will expand without issue though node splitting may occur, making certain insertions take longer. Each
B-tree node can have a maximum of p tree pointers, p data pointers, and p search key field values.
In general, a B-tree node may carry extra information required by the tree-manipulating algorithms,
such as the number of entries q in the node and a reference to the parent node.
5.6.3 B+-Trees
The majority of dynamic multilevel index implementations employ a variant of the B-tree data structure
known as a B +-tree. Every value of the search field appears once in the tree, along with a data pointer,
in a B-tree. Data pointers are only kept at the tree’s leaf nodes in a B+-tree; hence, the structure of leaf
nodes differs from the structure of inside nodes. If the search field is a key field, the leaf nodes include
an entry for each value of the search field, as well as a data reference to the record (or to the block that
contains this record). The pointer for a non-key search field leads to a block containing pointers to data
file records, adding an extra degree of indirection. The leaf nodes of the B+-tree are often connected to
allow ordered access to the entries via the search field. These leaf nodes are analogous to an index’s
initial (base) level. The B+-tree’s internal nodes correspond to the other levels of a multilevel index. To aid
the search, some search field values from the leaf nodes are repeated in the interior nodes of the B+-tree.
The interior nodes of a B+- tree of order p has the following structure:
 Each internal node is of the form
<P1,K1,p2,K2,…Pq-1,Pq>
where q ≤ p and each Pi is a tree pointer
 Within each internal node, K1 < K2 < ... < Kq−1
24
 For all search field values X in the subtree pointed at by Pi , we have Ki−1 < X ≤ Ki for 1 < i < q; X ≤ Ki
for i = 1; and Ki−1 < X for i = q
 Each internal node has at most p tree pointers.
 Each internal node, except the root, has at least [(p/2)] tree pointers. The root node has at least two
tree pointers if it is an internal node.
 An internal node with q pointers, q ≤ p, has q − 1 search field values.
 The structure of the leaf nodes of a B+-tree of order p in Figure 15is as follows: 1. Each leaf node is of
the form:
 <<K1,Pr1>,<K2,Pr2>,….<Kq-1,Prq-1>,Pnext>
where q ≤ p, each Pri is a data pointer, and Pnext points to the next leaf node of the B+-tree. Figure 15
depicts (a) internal node of a B-tree with q-1 search values & (b) leaf node of a B-tree with q-1search
values and q-1 data pointers:
The nodes of a B+-tree., (a) Internal node of a B+-tree with q-1 search values. (b) Leaf node of
a B+-tree with q-1 search values and q-1 data pointers.
P1 K1 ... Ki-1 Pi Ki ... Kq-2 Pq
Tree Tree Tree

pointer pointer pointer
(a)
X X X
X≤K1 Ki-1<X≤Ki Kq-1<x
K1 Pr1 K2 Pr1 ... Ki Pri ... Kq-1 Prq-1 Pnext Pointer to

next leaf
node in
tree
(b)
Data Data Data Data
pointer pointer pointer pointer
Figure 15: (a) Internal Node of a B-Tree with q-1 Search Values &
(b) Leaf Node of a B-Tree with q-1Search Values and q-1 Data Pointers:
 Within each leaf node, K1 ≤ K2 ... , Kq−1 , q ≤ p.
 Each Pri is a data pointer that leads to the record with the search field value Ki or to a file block that
contains the record (or to a block of record pointers that point to records whose search field value is
Ki if the search field is not a key).
 At least (p/2) values are present in each leaf node.
 All of the leaf nodes are on the same level.
Except for the P next pointer, which is a tree that refers to the next leaf node, pointers in internal nodes
are tree pointers to blocks that are tree nodes, but pointers in leaf nodes are data pointers to data file
records or blocks. It is feasible to traverse leaf nodes as a linked list by beginning at the leftmost leaf
node and utilising the P next pointers. This allows for more organised access to the data entries on
25
JGI JAIN
the indexing field. It is also possible to provide a P prior pointer. Because a B+-tree on a non-key field
requires an extra degree of indirection, the Pr pointers are block pointers to blocks that contain a series
of record pointers to the actual records in the data file.
Because entries in a B+-tree’s internal nodes include search values and tree pointers but no data pointers,
a B+-tree’s internal node can hold more entries than a similar B-tree. As a result, given the same block
(node) size, the order p of the B+-tree is greater than that of the B-tree, as illustrated. This can result in
fewer B+-tree levels, reducing search time. Because the topologies of internal and leaf nodes in a B+-tree
differ, the order p might vary. We’ll use p to represent the internal node order and Pleaf to represent the
leaf node order, which we define as the maximum number of data pointers in a leaf node.
5.7 INDEXES ON MULTIPLE KEYS

So far, we have assumed that the primary or secondary keys used to access files were single
characteristics (fields). Multiple characteristics are involved in much retrieval and update requests. If a
certain combination of characteristics is regularly used, it is beneficial to put up an access structure to
offer efficient access by a key value that is a combination of those attributes.
Consider an EMPLOYEE file with the characteristics Dno (the department number), Age, Street, City,
Zip code, Salary, and Skill code, as well as the key SSN (Social Security number). Consider the following
question: List the employees in Department 4 whose age is 59. It’s worth noting that Dno and Age are both
non-key properties, which means that a search for any of these will return several results. Alternative
search techniques to examine include:
 Assuming Dno has an index but Age does not, use the index to retrieve the records with Dno = 4 and
then choose the records that fulfill Age = 59 from among them.
 Alternatively, if Age is indexed but Dno is not, use the index to find the records with Age = 59 and then
choose the records that fulfil Dno = 4 from among them.
 If indexes on Dno and Age have been constructed, both indexes may be utilised; each delivers a set
of data or a set of pointers (to blocks or records). When these two sets of records or pointers are
intersected, the records or pointers that fulfil both requirements are returned.
All of these options finally provide the proper result. However, if the number of records that fit either
criterion (Dno = 4 or Age = 59) is considerable, but only a few records satisfy the combined requirement,
then none of the above techniques are efficient for the current search request. There are several options
for using the combination Dno, Age> or Age, Dno> as a search key made up of several properties. In the
sections that follow, we will go through these strategies in more detail. Keys with several properties will
be referred to as composite keys.
5.7.1 Multiple Attribute Ordered Index

All of the preceding information still applies if we establish an index on a search key field that is a
combination of Dno, Age>. The search key is a pair of values4, 59>. In general, if an index on attributes
is built, the search key values are tuples with n values:
An order on this composite search key is established by the lexicographic ordering of these tuple values.
In our example, all of the department keys for department 3 come before those for department 4. As a
26
result, any value of m and n comes first. For keys with Dno = 4, the ascending key sequence would be, and
so on. Lexicographic ordering works in the same way as character string ordering does.
 An index is a subset of a database table that is sorted by key values.

 Another sort of main file structure is based on hashing, which allows for extremely rapid access to
entries under specific search conditions.
 Internal hashing is a term used to describe the process of for internal files.
 External hashing refers to the process of hashing disc files.
 A bucket can be either a single disc block or a group of contiguous disc blocks.
 In extendible hashing, a kind of directory an array of 2d bucket addresses is kept, where d is the
directory’s global depth.
 Dynamic hashing was a forerunner of extendible hashing, in which the addresses of the buckets
were either the n high-order bits or the n 1 high order bits.
 The goal behind linear hashing is to allow a hash file to automatically increase and reduce its
number of buckets without the requirement for a directory.
 The primary index is an ordered file with fixed-length records and two fields that serves as an access
structure to quickly search for and retrieve data records in a data file.
 A clustering index is similarly an ordered file with two fields in which, the first is of the same type as
the data file’s clustering field, and the second is a disc block pointer.
 If file entries are physically sorted on a non-key field with no distinct value for each record, that field
is referred to as the clustering field.
 A secondary index gives a secondary method of accessing a data file that already has a primary
method of access.
 A search tree is a form of a tree that guides the search for a record based on the value of one of the
record’s fields.
 The majority of dynamic multilevel index implementations employ a variant of the B-tree data
structure known as a B +-tree.
5.9 GLOSSARY
 Index: A subset of a database table that is sorted by key values

 Hashing: It allows for extremely rapid access to entries under specific search conditions
 Internal hashing: A term used to describe the process of for internal files
 External hashing: The process of hashing disc files
 Bucket: A single disc block or a group of contiguous disc blocks
 Extendible hashing: A kind of directory an array of 2d bucket addresses is kept, where d is the
directory’s global depth
27
JGI JAIN
 Dynamic hashing: A forerunner of extendible hashing, in which the addresses of the buckets were
either the n high-order bits or the n 1 high order bits
 Linear hashing: It allows a hash file to automatically increase and reduce its number of buckets
without the requirement for a directory
 Primary index: An ordered file with fixed-length records and two fields that serve as an access
structure to quickly search for and retrieve data records in a data file
 Clustering index: An ordered file with two fields in which, the first is of the same type as the data
file’s clustering field, and the second is a disc block pointer
 Secondary index: A secondary method of accessing a data file that already has a primary method
of access
 Search tree: A form of a tree that guides the search for a record based on the value of one of the
record’s fields
 B+ tree: The majority of dynamic multilevel index implementations employ a variant of the B-tree
data structure

1. Indexes are created using various data structures such as B+ trees, balanced trees, and hashes.
Describe the concept of indexes.
2. Internal hashing is a term used to describe the process of for internal files, hashing is often done as
a hash table using an array of records. Explain the importance of hashing technique.
3. Evaluate the concept of primary indexes.
4. A clustering index is similarly an ordered file with two fields in which, the first is of the same type as
the data file’s clustering field, and the second is a disc block pointer. Discuss.
5. The B+-tree’s internal nodes correspond to the other levels of a multilevel index. Describe the concept
of B+-tree.
Hints for Essay Type Questions

1. An index (IDX) is a data structure established on columns in a database table to significantly speed
up data retrieval operations in database systems. Refer to Section Introduction
2. Another sort of main file structure is based on hashing, which allows for extremely rapid access
to entries under specific search conditions. This type of arrangement is sometimes referred to as a
hash file. Refer to Section Hashing Techniques
3. The primary index is an ordered file with fixed-length records and two fields that serves as an access
structure to quickly search for and retrieve data records in a data file. Refer to Section Types of
Single-Level Ordered Indexes
28
4. If file entries are physically sorted on a non-key field with no distinct value for each record, that field
is referred to as the clustering field, and the data file is referred to as a clustered file. Refer to Section
Types of Single-Level Ordered Indexes
5. The majority of dynamic multilevel index implementations employ a variant of the B-tree data
structure known as a B +-tree. Refer to Section Dynamic Multi-Level Indexes Using B Trees and B+
Tree
@ 5.12 POST-MODULE READING MATERIAL
 http://rajaghoshtech2.blogspot.com/2010/03/advantages-and-disadvantages-of-hashing.html
 http://www.differencebetween.info/difference-between-b-tree-and-b-plus-tree
 Discuss with your friends and classmates about the concept of indexing. Also, discuss the importance
of hashing techniques.
29
UNIT
06 Object-Relational Databases
Names of Sub-Units
Introduction, Overview of Object Relational Database Concepts, Object Relational Features, Object
Database Extensions to SQL, Object Oriented Models
Overview
This unit begins by discussing about the concept of object relational database. Next, the unit discusses
the overview of object database concepts and object relational features. Further the unit explains the
object database extensions to SQL. Towards the end, the unit discusses the object oriented models.
Learning Objectives

 Discuss the concept of object relational database
 Explain the overview of object database concepts
 Describe the concept of object relational features
 Explain the significance of object relational database extensions to SQL
 Discuss the importance of object oriented models
JGI JAIN
Learning Outcomes

 Evaluate the concept of object relational database
 Assess the overview of object database concepts
 Evaluate the importance of object relational features
 Determine the significance of object relational database extensions to SQL
 Explore the importance of object oriented models
https://www.cs.uct.ac.za/mit_notes/database/pdfs/chp16.pdf
6.1 INTRODUCTION
A Database Management System (DBMS) that combines a relational database (RDBMS) and an object-
oriented database (OODBMS) is known as an object-relational database (ORD). ORD’s schemas and
query language support the core components of any object-oriented database model, such as objects,
classes, and inheritance.
ORD is referred to as the “middleman” between relational and object-oriented databases because it
combines traits and characteristics from both models. Since the data is kept in a typical database and
processed and accessed via queries defined in a query language like SQL, the basic method in ORD is
RDB. ORD, on the other hand, has an object-oriented feature in that it is regarded an object store, which
is typically used for software built in an object-oriented programming language. APIs are used to store
and access data as objects in this case.
Traditional RDBMS products focus on the effective organisation of data produced from a small number
of data sources. An ORDBMS, on the other hand, offers a capability that allows developers to create and
innovate their own data types and techniques for use in the DBMS. ORDBMS hopes to achieve this by
allowing developers to view the problem area at a higher level of abstraction.
6.2 OVERVIEW OF OBJECT RELATIONAL DATABASE CONCEPTS

The word object-oriented, abbreviated OO or O-O, derives from Object Oriented programming languages,
often known as OOPLs. OO ideas are currently used in databases, software engineering, knowledge
bases, artificial intelligence, and computer systems in general. The SIMULA language, developed in the
late 1960s, is the foundation of OOPLs. Smalltalk, created at Xerox PARC8 in the 1970s, was one of the
first programming languages to explicitly embrace other OO ideas such as message forwarding and
inheritance. It is classified as a pure OO programming language, which means that it was purposefully
2
UNIT 06: Object-Relational Databases JGI JAIN
developed to be object-oriented. In contrast, hybrid OO programming languages embed OO features

into an existing language. C++, which mixes OO principles into the popular C programming language,
is an example of the latter.
An object normally consists of two parts: its state (value) and its behaviour (operations). It can have a
complicated data structure as well as operations that the programmer defines. Objects in an OOPL exist
just during programme execution; hence they are referred to as transitory objects. An OO database can
prolong the existence of objects by storing them permanently in a database, transforming them into
persistent objects that survive after programme termination and can be accessed and shared by other
applications. In other words, OO databases permanently retain persistent items in secondary storage
and allow them to be shared by various programmes and applications. This necessitates the inclusion of
other well-known database management system capabilities, such as indexing techniques to efficiently
locate the objects, concurrency control to allow object sharing across concurrent processes, and failure
recovery. An OO database system will frequently interact with one or more OO programming languages
to offer persistent and distributed object capabilities.
In OOPLs, an object’s internal structure is defined by instance variables, which hold the values that
make up the object’s internal state. In the relational model, an instance variable is analogous to the
idea of an attribute, except that instance variables might be wrapped within the object and so are not
always available to external users. Instance variables can also have data types that are arbitrarily
complex. Object-oriented systems allow you to specify operations or functions (behaviour) that can be
applied to specified types of objects. In fact, some OO models require that all activities that a user may
do on an object be specified. This necessitates the full encapsulation of things. For two reasons, most
OO data models have eased this rigorous approach. To begin, database users frequently need to know
the attribute names in order to define selection criteria on the attributes in order to get certain objects.
Second, because perfect encapsulation implies that every basic retrieval requires a specified process, ad
hoc searches are difficult to express on the fly.
An operation is divided into two stages to facilitate encapsulation. The first portion, known as the
operation’s signature or interface, defines the operation name and arguments (or parameters). The
second part, known as the method or body, specifies the operation’s implementation, which is typically
written in a general-purpose programming language. Operations are triggered by sending a message
to an object that includes the operation name and arguments. The method for that action is then
executed by the object. This encapsulation allows for the alteration of an object’s internal structure as
well as the implementation of its operations without disrupting the external programs that execute
these activities. As a result, encapsulation enables data and operational independence.
Type and class hierarchies, as well as inheritance, are important concepts in OO systems. This enables
the definition of new types or classes that inherit a large portion of their structure and/or operations
from previously specified types or classes. This makes it easy to iteratively construct a system’s data
types and reuse existing type definitions when generating new sorts of objects.
One issue in early OO database systems was describing connections between objects. The early OO data
model’s stress on total encapsulation led to the idea that relationships should not be overtly represented
3
JGI JAIN
but rather expressed by creating suitable functions that find related objects. However, this strategy
may not work well for complicated datasets with many relationships since it is necessary to identify and
display these links to users.
Another OO notion is operator overloading, which refers to the capacity of an operation to be applied
to several sorts of objects; in such a case, an operation name may refer to numerous separate
implementations, depending on the type of object to which it is applied. This characteristic is also known
as operator polymorphism. For example, the technique (implementation) of calculating the area of a
geometric object may change depending on whether the object is a triangle, circle, or rectangle. This
may need the usage of runtime late binding of the operation name to the proper method after the type
of object to which the operation is applied is known.
6.2.1 Object Identity and Objects versus Literals

One purpose of an ODMS (Object Data Management System) is to maintain a direct connection between
real-world and database objects such that objects retain their integrity and identity and can be readily
identified and worked on. As a result, an ODMS assigns a distinct identity to each independent item
recorded in the database. This distinct identification is often realized through the use of one-of-a-kind,
system-generated object identifier (OID). The value of an OID is hidden from the outside world, but it is
used by the system to identify a specific each object and to develop and maintain inter-object interactions.
When necessary, the OID can be allocated to program variables of the proper kind. The most important
attribute of an OID is that it be immutable; that is, the OID value of a certain object should not change.
This ensures that the identification of the real-world thing being represented is preserved. As a result, an
ODMS must provide a mechanism for creating OIDs while maintaining the immutability requirement.
Each OID should also be used only once; that is, even if an item is deleted from the database, its OID should
not be allocated to another object. These two qualities suggest that the OID should not be dependent on
any of the object’s attribute values, because the value of an attribute can be modified or rectified. This is
analogous to the relational paradigm, in which each relation must have a primary key property whose
value uniquely identifies each tuple. Though the value of the primary key is modified in the relational
model, the tuple will have a new identity, even if it still represents the same real-world item. However,
the key properties of a real-world object may have various names in different relationships, making it
difficult to verify whether the keys refer to the same real-world item (for example, the object identifier
may be represented as Emp id in one relation and as Ssn in another).The physical address of the item
in storage should not be used as the OID since the physical address might change following a physical
reorganization of the database. However, some early ODMSs used the physical address as the OID to
improve item retrieval performance. If the object’s physical address changes, an indirect pointer can be
inserted at the previous address, indicating the object’s new physical position. Long integers are more
commonly used as OIDs, with some type of hash table used to map the OID value to the current physical
address of the object in storage.
Some early OO data models required that everything be represented as an object, from a simple value to
a sophisticated object; hence, every basic value, such as an integer, string, or Boolean value, had an OID.
This allows two identical basic values to have distinct OIDs, which is advantageous in some situations.
4
For example, the integer number 50 can sometimes be used to represent weight in kilos, and other times
it might represent a person’s age. Then, two basic objects with unique OIDs may be generated, but both
would represent the integer value 50. Although valuable as a theoretical model, this is not particularly
practical because it generates an excessive number of OIDs. As a result, most OO database systems
support the representation of both objects and literals (or values). An immutable OID is required for
every object; however, a simple value does not have an OID and its value remains alone. As a result, a
literal value is usually held within an object and cannot be accessed from other objects. If necessary,
sophisticated structured literal values can be constructed in many systems without a matching OID.
6.3 OBJECTS-RELATIONAL FEATURES

An Object-relational model is a hybrid of an Object-oriented and Relational database model. As a result,
it enables objects, classes, inheritance, and other features similar to those found in Object Oriented
models, as well as data types, tabular structures, and other features similar to those found in Relational
data models.
One of the primaries aims of the Object-relational data model is to bridge the gap between relational
databases and the object-oriented approaches used in many programming languages such as C++, C#,
Java, and others.
Some of the key features of ORDBMS are as follows:

 Abstract Data Types (ADTs) specified by the user: ADTs enables the definition of new data types
with structures tailored to specific purposes.
 User-definable routines: These enable the creation of customized server functions with much of the
power and functionality expressible in C.
 Inheritance: Inheritance of user-defined types is supported in order to create new sub-types or sub-
classes that inherit all properties and methods from the parent type.
 Encapsulation: Encapsulation is a language method that limits access to part of an object’s
components.
 Object identification: Object identifiers (OIDs) are used to identify things.
 Support for large objects: These are disk-based objects that work similarly to random access files.
They are used by ADTs to store any data that does not fit inside a table row.
 Extensive and adaptable indexing: For multidimensional data, R-tree or Generalized Search Tree
(GiST) indexing allows for quick searching of certain ADTs in a database.
6.4 OBJECT DATABASE EXTENSIONS TO SQL

The object-relational extensions to SQL refer to contains row types, collection types, User Defined Types
(UDT), typed tables, and reference types are all object-relational extensions to SQL. All of the properties
of the object-oriented data model are not supported by the object-relational data model.
5
JGI JAIN
Many aspects of software development, including database systems, are being influenced by object-
oriented technology. Support for XML was added to SQL in the 1999 and 2003 standards. A few object-
oriented ideas the data model that these specifications use is known as the data model is object-relational.
This model incorporates all of the traditional concepts from the object data model incorporates certain
ideas from the relational data paradigm and the features that enable the construction of classes, Objects
and class hierarchies.
The object-relational data model is not a complete object data model because the top-level entries in
the database are confined to be tables. Tables can have attribute values for objects, while typed tables
only have objects. Servers from Oracle implement certain object-relational characteristics and can be
utilised in object-oriented applications projects with a database component should be designed.
6.5 OBJECT ORIENTED MODELS

A data model based on object-oriented programming that associates methods (procedures) with
objects that can benefit from class hierarchies is known as an object data model. As a result, “objects”
are abstraction levels that include properties and behaviour. A persistent object management and
shareability model is an object-oriented data model that expands the individual programme space.
Data and their interactions are contained in a single structure in an Object Oriented Data Model, which
is referred to as an object. Real-world problems are represented as objects with various qualities in this
model. There are various relationships between every object. It is essentially a blend of Object Oriented
Programming and Relational Database Model.
6.5.1 Components of Object Oriented Data Model

The following points describe the components of object oriented data model:
 Object: In the object-oriented realm, an object is a physical component. Tangible things, such as a
person or a car, or an intangible entity, such as a project, are both examples of objects.
 Class: A class is an object representation. It describes a collection of items with comparable attributes
and predictable behavior.
 Abstraction: Complexity is reduced by abstraction. It results from noticing similarities between
items. Abstraction occurs when the system emphasizes details that are essential to the user. It
focuses mostly on the object’s outside. When data is abstracted, it is protected by a set of methods,
and only those methods have access to the object’s data.
 Encapsulation: Encapsulation is performed by concealing information or data in order to minimize
complexity and promote reusability. Although the user cannot see inside an object or a class, the object
may be reached by invoking the object’s methods. The concepts of abstraction and encapsulation
are compatible. In Abstraction, the system focuses on object behaviour and functionality. Execution
is the emphasis of encapsulation, which leads to action.
6
 Hierarchy: The hierarchy depicts the order in which objects in a system are arranged. It also
describes the link between the many components of a system. A class of the hierarchy is made up
of several characteristics and functions. A hierarchy class is made up of a base (parent) class and
derived classes (subclass). The properties of a parent class are passed down to a derived class. A
class can be made of inter-related sub-classes, which in turn can have sub-classes until the smallest
level of components is achieved.
 Modularity: Modularity refers to the division of a program into components or modules in order to
lessen the complexity of the problem. Modularity occurs in large and strong programs with several
classes. The modules aid in the management of complexity. Implementation is the emphasis on
modularity. Creating a link between modularity and encapsulation. Modularity may be thought of
as a method of breaking down enclosed abstraction into simple components. This occurs following
the partitioning of a system into modules.
6.5.2 Object Definition Language (ODL)

The ODL is independent of any programming language and is designed to support the semantic constructs
of the ODMG object model. Its primary function is to generate object specifications, such as classes
and interfaces. As a result, ODL is not a complete programming language. A database schema may be
specified in ODL irrespective of any programming language, and then particular language bindings
can be used to indicate how ODL constructs can be translated to constructs in certain programming
languages such as C++, Smalltalk, and Java.
A data description language that is object-oriented an extension of IDL (Interface Description Language),
which is a component of Common Object Request Broker Architecture (CORBA) is a networked OO
computing standard. ODMG standardized (Object Data Management Group). OID (object identity): a
deviation from the E/R model Attributes, relationships, and methods are examples of ODL properties.
The EYEDB Object Definition Language (ODL) is a specification language for defining object type
specifications based on the ODMG ODL (but not compliant).
ODL 1 is not meant to be a full-fledged programming language. It is a language for defining object
specifications. Traditionally, database management systems give tools for data definition (through a
Data Definition Language (DDL)). The Data Definition Language (DDL) allows users to define data types
and interfaces, while the Data Manipulation Language (DML) allows users to create, remove, read, and
update instances of the data types. ODL is a kind of DDL for objects. It defines type features, such as
attributes and operations. ODL merely specifies the signatures of C++ operations; it does not address
the definitions of the methods that implement those operations. The ODL can specify operations defined
in OQL.
The goal of ODL is to specify object types that may be implemented in a range of computer languages.
As a result, ODL is not bound by the syntax of any single programming language.
7
JGI JAIN
Several characteristics distinguish EYEDB ODL from ODMG ODL are as follows:
 ODMG Class properties, relationships, method signatures, and keys are all defined by ODL. Nested
classes, typedef constructions, constant definitions, and exception hints are all supported.
 The EYEDB ODL specification specifies class attributes, relationships, method signatures, attribute
restrictions (not null, unique, collection cardinality), index specifications and trigger declarations.
Nested classes, typedef constructions, constant declarations and exception hints are not supported.
 Any type of instance in EYEDB ODL can be both a literal and an object. This attribute is connected
to the type in ODMG ODL: Interfaces and classes are objects, while all fundamental types and user-
defined structs are literals. Any type of instance, along with the fundamental types, can be an object
in EYEDB ODL.
 Finally, You can specify if a method is run by the server or the client, as well as whether it is a class
or instance method, using EYEDB ODL.
6.5.3 Object Database Conceptual Design

The conversion of a binary-relationship (BR) conceptual schema to an object-oriented (OO) database
schema is described using an algorithm. The BR schema is a highly semantic diagram that depicts
existence asobjects, connections, andrestrictions. It is simple to use andactsasameans of communication
between consumers and designers. As a result, it may be built early in the system development process
and then translated into a specific OO database schema. The transformation approach makes use of a
multi-stage algorithm that first detects the main items in the BR schema, as well as their relationships
and restrictions. These are then translated to object classes, characteristics, and constraints while
keeping the semantics and all forms of constraints from the conceptual schema.
Assuming this issue applies to the Relational Database Management System (RDBMS) paradigm, there
are three common design stages are as follows:
 Design of an abstract database: This approach necessitates the creation of a data model of all of
the information in a system. The resultant entity-relationship diagram (ERDM) will reflect the real-
world domain that the database will describe, and it will be evaluated and examined for redundancy
to see whether it fits user/business needs.
 Design of a logical database: The design of the database from the ERDM is part of this procedure.
This is frequently accomplished through a process of data normalization (typically to the third
normal form) and entity modeling to construct the tables and relationships that will comprise the
database’s design (or schema). This step does not examine which implementation of the RDBMS
paradigm will be used, such as MySQL, SQL Server, DB2, and so on, or which hardware platform the
database would sit on.
 Design of the physical database: The database’s physical design describes the physical setup of
the database in the chosen RDBMS implementation. This stage entails defining the basic relations,
file organizations, indexes of primary or foreign keys in tables, entity, and referential integrity
requirements, and security measures, among other things.
8
A database’s conceptual design refers to the first step of the design process, which encompasses
design interactions, experience, and strategies.
More than merely expressing tiny portions of the program in terms of ER [Entity Relationship]
diagrams is included in Conceptual Design. A big corporation will necessitate the efforts of more
than one designer, data, and application code utilized by a variety of user groups. Using a high-
level, semantic data model for conceptual design, such as ER diagrams, in such an environment has
the added benefit of allowing the high-level design to be diagrammatically represented and easily
understood by the many individuals who must contribute input to the design process.
The approach used to structure the creation of the overall design and to guarantee that the design
takes into consideration all user needs and is consistent is an essential part of the design process.
A possible method would be to create adistinct conceptual schema for each user group and then merge
this conceptual schema. To integrate diverse conceptual schemas, we must build correspondences
between entities, relationships, and characteristics and resolve a variety of conflicts (e.g., naming
conflicts, domain mismatches, and differences in measurement units).
6.5.4 Object Query Language (OQL)

The query language proposed for the ODMG object model is the OQL. It is intended to interact closely
with programming languages that have an ODMG binding, such as C++, Smalltalk, and Java. As a
result, an OQL query embedded in one of these programming languages can return objects that match
that language’s type system. Additionally, the code for class operation implementations in an ODMG
schema may be written in various programming languages. The OQL query syntax is similar to that of
the relational standard query language SQL, but it has extra capabilities for ODMG concepts such as
object identity, complex objects, operations, inheritance, polymorphism, and relationships.
The ODMG query language OQL was purposefully intended to have a syntax similar to SQL in order to
make it simple for SQL users to learn OQL. Let us begin with a query that finds pairings of movies and
theatres in which the movie is displayed and the theatre is showing more than one movie:
SELECT mname: M. movieName, tname: T,theaterName
FROM Movies M, M.shown At T
WHERE T.numshowing() > 1
The SELECT clause specifies how we might name the fields in the result; the two result fields are referred
to as mname and tname. The FROM clause is the component of this query that differs from SQL. The
variable M is tied to each movie in the scope of Movies in turn. We associate the variable T to each
theatre in the collection M. shown at in turn for a given movie M. As a result, we can simply describe a
nested query by using the path expression M. shown At. The query below demonstrates the grouping
construct in OQL:
SELECT T.ticketPrice,
avgNum: AVG(SELECT P.T.numshowing() FROM partition P)
FROM Theaters T
GROUP BY T.ticketPrice
We build a collection of theatres with that ticket price for each ticket price. This set of theatres is referred
to as the partition for that ticket price and is denoted by the OQL term partition. In the SELECT clause,
9
JGI JAIN
we compute the average number of movies shown at theatres in the partition for each ticket price. OQL
has an intriguing version of the grouping operator that SQL does not have:
SELECT low, high,
avgNum: AVG(SELECT P.T.numshowing() FROM partition P)
FROM Theaters T
GROUP BY low: T.ticketPrice <5, high: T.ticketPrice >=5
The GROUP BY clause now only produces two divisions, low and high. Each theatre object T is assigned
to one of these divisions’ dependent on the price of its ticket. In the SELECT clause, low and high are
boolean variables, one of which is always true in each given output tuple; partition is instantiated to the
relevant partition of theatre objects. We obtain two result tuples in our case. One of them has low equal
to true and avgNum equal to the average number of movies played in low-cost theatres. The second
tuple has high equal to true and avgNum equal to the average number of movies shown at high-priced
cinemas. The following query demonstrates OQL capability for collections other than set and multiset:
(SELECT T.theaterName
FROM Theaters T
ORDER BY T.ticketPrice DESC) [0:4]
The ORDER BY clause produces a list of theatre names sorted by ticket price. A list’s items can be referred
to by position, beginning at position 0. As a result, the equation [0:4] returns a list of the names of ve
theatres with the highest ticket costs.
6.6 LAB EXERCISE

The university database contains information about university students, courses, the semester in which
a student attended a specific course (as well as his internal and external marks if he passed), and the
degree programme in which each student is enrolled. The database is far from being suited for a huge
academic institution, but it does demonstrate intriguing relationships that can be queried, and it’s easy
to relate to while learning SQL. The requirements are explained next.
Take a look at the following list of requirements:

 One or more events are available at the university.
 One or more department makes up a programme.
 A student is required to enroll in an event.
 A student enrolls in the courses required for her events.
 An event has a name, identification, a total credit score requirement, and a beginning year.
 A course has a name, a unique identification, a credit point value, and a start year.
 Students have one or more given names, a surname, student identification, a DOB, and the year they
enrolled for the first time. .
 When a student enrolls in a department, the year and semester in which he took it are noted. When
he completes the course, he receives marks (such as 60%).
 A year (for example, year 1) and a semester are assigned to each department in an event.
10
UNIT 06: Object-Relational Databases JGI JAINDEEMED-TO-BE UNIVERSIT Y
The ER diagram of university database is shown in Figure 1:
Sts Name
Sts Name
Year
Sts Surname
Student Enrollno Events
Department
ID
Credit Score
University
Enrollno Year
Features
Semester
DOB
Attemots
Sessional
Sts Name
Department
Year Year
Internal
Marks Branch
University
External Enrollno
Marks
Figure 1: University Database
The above ER diagram considered the following points:

 Student is a powerful object with a unique identity, student id, that was designed to be the primary
key for distinguishing students (remember, we could have several students with the same name).
 The identifier event id is used as the primary key to distinguish amongst events, making it a strong
entity.
 Because each student must be enrolled in events, the Student entity is fully involved in the many-
to-one Enrollno relationship with the event entity. An event can exist without any students enrolled;
hence it is only a portion of this connection.
 A department is a weak entity with id as a weak key since it has meaning exclusively in the context
of an event. This means that a department only be identified by its student id and the student id of
the event that owns it.
 Department, as a weak entity, fully participates in its owner events many-to-one identifying
connection. Year and Semester qualities identify the relationship’s sequence position.
11
 The many-to-many attempts relationships link a student to a department; a department can exist
without a student, and a student can enrol without attempt any events, therefore participation is
not complete.
 There are properties to capture the Year and Semester, as well as the Marks, when a student attempts
a course.
 A database management system (DBMS) that combines a relational database (RDBMS) and an
object-oriented database (OODBMS) is known as an object-relational database (ORD).
 A data model based on object-oriented programming that associates methods (procedures) with
objects that can benefit from class hierarchies is known as an object data model.
 Data and their interactions are contained in a single structure in an Object Oriented Data Model,
which is referred to as an object.
 Encapsulation is performed by concealing information or data in order to minimize complexity and
promote reusability.
 The ODL is independent of any programming language and is designed to support the semantic
constructs of the ODMG object model.
 The Data Definition Language (DDL) allows users to define data types and interfaces.
 Data Manipulation Language (DML) allows users to create, remove, read, and update instances of
the data types.
 The query language proposed for the ODMG object model is the object query language OQL.
6.8 GLOSSARY
 Database Management System (DBMS): A relational database (RDBMS) and an object-oriented

database (OODBMS) is known as an object-relational database (ORD).
 Object Data Model: A data model based on object-oriented programming that associates methods
(procedures) with objects that can benefit from class hierarchies.
 Object: Data and their interactions are contained in a single structure in an Object Oriented Data
Model, which is referred to as an object.
 Encapsulation: It is used to perform by concealing information or data in order to minimize
complexity and promote reusability.
 Object Definition Language (ODL): To support the semantic constructs of the ODMG object model.
 Data Definition Language (DDL): To define data types and interfaces.
 Data Manipulation Language (DML): To create, remove, read, and update instances of the data
types.
 Object Query Language (OQL): The query language proposed for the ODMG object model.
12

1. Describe the concept of object relational database.
2. Data and their interactions are contained in a single structure in an object oriented data model.
Explain the significance of object oriented data models.
3. ODL is not a complete programming language. Discuss
4. The query language proposed for the ODMG object model is the object query language OQL. Evaluate
the concept of Object Query Language (OQL).
5. Explain the importance of object database extensions to SQL.
1. A database management system (DBMS) that combines a relational database (RDBMS) and an
object-oriented database (OODBMS) is known as an object-relational database (ORD). Refer to
Section Introduction
2. A data model based on object-oriented programming that associates methods (procedures) with
objects that can benefit from class hierarchies is known as an object data model. Refer to Section
Object Oriented Models
3. The ODL is independent of any programming language and is designed to support the semantic
constructs of the ODMG object model. Refer to Section Object Oriented Mod
4. It is intended to interact closely with programming languages that have an ODMG binding, such
as C++, Smalltalk, and Java. As a result, an OQL query embedded in one of these programming
languages can return objects that match that language’s type system. Refer to Section Object
Oriented Models
5. The object-relational extensions to SQL refer to contains row types, collection types, user defined
types (UDT), typed tables, and reference types are all object-relational extensions to SQL. Refer to
Section Object Database Extensions to SQL
13
 https://asolanki.co.in/wp-content/uploads/2019/02/Fundamentals_of_Database_Systems_6th_
Edition-1.pdf
 https://vivadifferences.com/12-differences-between-object-oriented-database-and-object-
relational-database-plus-similarities/
 Discuss with your friends and classmates the concept of Object Relational Database (ORD). Also,
discuss about the different types of object data models.
14
UNIT
07 Query Processing and Optimization
Names of Sub-Units
Introduction, Measures of Query Cost Selection Operation, Query Optimization Overview,

Transformation of Relational Expressions, Estimating Statistics of Expression Results, Choice of
Evaluation Plans
Overview
This unit begins by discussing about the concept of query processing and optimization. Next, the unit
discusses the measures of query cost selection operation and query optimization overview. Further
the unit explains the transformation of relational expression and estimating statistics of expression
results. Towards the end, the unit discusses the choice of evaluation plans.
Learning Objectives

 Discuss the concept of query processing and optimization
 Explain the concept of query cost selection operation and query optimization overview
 Describe the transformation of relational expressions
 Explain the significance of estimating statistics of expression results
 Discuss the choice of evaluation plans
JGI JAIN
Learning Outcomes

 Evaluate the concept of query processing and optimization
 Assess the concept of query cost selection operation and query optimization overview
 Evaluate the importance of transformation of relational expressions
 Determine the significance of estimating statistics of expression results
 Explore the choice of evaluation plans
 https://clas.iusb.edu/computer-science-informatics/research/reports/TR-20080105-1.pdf
7.1 INTRODUCTION
Query processing and optimization are essential components of every DBMS. To be used efficiently,
query results must be accessible in the timescale required by the submitting user, which might be a
human, a robotic assembly machine, or even another unique and independent DBMS.
The following are the steps required in query processing and optimization:
 The query execution plan, also known as the query evaluation plan, is a set of primitive actions that
can be used to assess a query.
 The query processing engine accepts a query evaluation plan, runs it, and outputs the desired
results. Depending on the number of discs, different execution plans for a given query can have
varying prices. It is the system’s obligation to create a query evaluation plan that minimises query
evaluation costs. Query optimization is the term for this process.
 A query optimization is inspected, parsed, and validated using a high-level query language. In the
text of the query, the scanner detects SQL keywords, attributes, and relation names.
 The parser examines the syntax to see if the query is written in accordance with the query language’s
syntactic guidelines.
 Finally, the query is assessed by ensuring that all attributes and relation names in a database’s
schema are valid and conceptually relevant.
 The query is then represented internally as a tree or a graph, referred to as the query tree or query
graph.
 A query tree is the internal representation of a SQL query when it is translated into relational
algebra. If TRC or DRC is not used, the internal representation is a query graph. A graph contains
many different execution strategies, and query optimization is the process of selecting the most
cost-effective technique.
2
UNIT 07: Query Processing and Optimization JGI JAIN
Figure 1 depicts the query processing and optimization:
Query in high-level
language
Parser and
translator
Internal representation of
the query
Query optimezer
Execution plan
DBMS catalog
Query Evaluation
engine
Result of the query

Data
Figure 1: Query Processing and Optimization
7.2 MEASURES OF QUERY COST SELECTION OPERATION

Query cost is a cost that the optimizer calculates based on how much time your query will take
(comparative with absolute clump time). The analyzer then tries to choose the best query plan by looking
at your query and data insights, experimenting with a few execution strategies, and selecting the most
cost-effective of them.
The cost of a query in a database management system can be calculated by establishing a framework
that can generate multiple designs for a query. It is usually accomplished by analyzing every possible
arrangement in terms of their estimated cost. The cost of each activity within an arrangement should be
established in a deterministic and aggregated cost to get the net assessed cost of the query assessment
plan when calculating the net evaluated cost of any arrangement.
Selection Operation
In most cases, the file scan performs the selection operation. The search algorithms that are utilised to
locate and retrieve data are known as file scans. In query processing, it is the lowest-level operator.
3
JGI JAIN
File Scans and Indices use Selection Operation
The file scans in RDBMS (relational database management systems) reads a relation only if the entire
relation is stored in one file. When performing a selection operation on a relation with tuples contained
in a single file, the following algorithms are used:
 Linear search: In a linear search, the system checks each record to see if it meets the specified
criteria. An initial seek is required to access the first block of a file. If the blocks in the file are not
stored in the same order, more seeks are required. However, linear search is the slowest search
method, yet it can be employed in a variety of situations. The nature of the selection, the presence of
indices, and the file sequence are irrelevant to this procedure. Other algorithms, on the other hand,
aren’t appropriate in all situations.
Selection Operation using Indexes
Index scans are a type of index-based search method. Access pathways are index structures like this. The
data in the file can be located and accessed using all these paths. The index is used in query processing
by the following algorithms:
 Primary index, equality on a key: To make the selection, we use the index to choose a single record
that meets the equality criteria. The equality check is done on the key attribute with the primary key.
 Primary index, equality on non-key: The difference among equality on key and equality on non-key
is that we can retrieve many records in this case. When the selection criteria provide an equality
comparison on a non-key, we can get several entries using primary keys.
 Secondary index, equality on key or non-key: The secondary index can be used in a selection that
defines an equality criterion. We can obtain a single record when equality is on key or several records
while equality is on non-key using the secondary index strategy. The time cost of retrieving a single
record is equal to the primary index. Numerous records may be stored on various blocks in the case
of multiple records. This means that each fetched record requires one I/O operation, which includes
a seek and a block transfer.
7.2.1 Sorting
Sorting is the method of storing records in one or more columns in ascending or descending order. It’s
important because some queries will require us to return sorted records, as well as some processes, such
as joins, will be more efficient with sorted records. The primary key column is used to sort all of the data
by default. We can also specify that the data be sorted based on other columns if necessary. There are
primarily two sorts of sorting procedures employed.
The sorting method is used in query processing to efficiently perform various relational operations such
as joins and so on. The system, on the other hand, requires a sorted input value. We must create an index
on the sort key in order to sort any relation, and then use that index to read the relation in sorted order.
Using an index, however, we may arrange the relationship rationally rather than physically. As a result,
sorting is done for the following cases:
Case 1: Relationships those are smaller or medium in size than the primary memory.
Case 2: Relations with a larger size than the available memory.
4
In Case 1, the primary memory is not exceeded by the little or medium-sized relations. As a result, we’ll
be able to remember them. As a result, we may utilise typical sorting methods like quicksort, merge sort,
and so on.
The conventional algorithms do not operate effectively in Case 2. We employ the external sort-merge
technique for such relations whose size surpasses the memory limit.
Sorting of relations that do not fit in memory due to their size being greater than the memory size.
External sorting is a term for this form of sorting. As a result, for external sorting, the external-sort
merge is the most appropriate option.
With the help of an example, explain how the external merge-sort algorithm works and analyse the cost
of external sorting is shown in Figure 2:
a 18
a 23
g 25 d 30 a 16
b 12
a 18 g 25 a 18
c 32
d 32 b 13
b 13 d 30
c 34 c 34
c 32 e 18
b 15 d 5
e 19 g 23
e 17 d 22
r 17 d 32
d 22
d 22 a 13 e 15
m 2
m 4 d 8 c 23
r 12
p 3 d 22 m 3
d 6 m 4 p 6
a 13
a 15 p 3 r 15
d 8
initial r 17 sorted
p 1
relation runs output
runs
create merge merge
runs pass-1 pass-2
Figure 2: External Sorting using Sort-Merge
Assume we’re executing an external sort-merge on a relation R. Suppose that each block can only carry
one tuple and that the memory can only hold three blocks. As a result, it will use two blocks as input and
one block as output while performing Stage 2, i.e. the merge stage.
7.2.2 Join Operation

A join is a SQL operation that creates a relationship between two or more database tables by establishing
a connection between them based on matching columns. In a SQL database management system, join
commands are used in the majority of complicated queries.
5
There are various types of join operation are as follows:

 Nested-loop join: Each tuple in the outer relation must be compared to all the tuples in the inner
relation in the nested loop join algorithm, and then only the subsequent tuple of the outer relation is
examined. The result of the join includes all pairs of tuples that satisfy the criterion. Nested loop join
is further classified into two types:
 Block nested-loop join: All the tuples in a block of external relation are compared with all the
tuples in the inner relation in block nested loop join, and then just the next block of outer relation
is examined. The result of the join includes all pairs of tuples that satisfy the criterion.
 Indexed nested-loop join: If necessary, index lookups can be used instead of file scans. Join is
an equi-join, often known as a natural join, and the inner relation’s join attribute has an index.
Can create an index just for the purpose of computing a join. Use the index to seek up tuples in
s that meet the join condition with tuple t r for each tuple t r in the outer relation r. Worst-case
scenario: buffer only has room for one page of r, and we conduct an index lookup on s for each
tuple in r.
Where c is the cost of traversing the index and retrieving all matching s tuples for a single tuple
or r. Using the join condition, c may be approximated as the cost of a single selection on s. If
indices are provided on both r and s’s join attributes, utilize the relation with the fewest tuples
as the outer relation.
 Merge-join: A merge join sort both relations according to their join attribute (if not already sorted
on the join attributes).
To link the ordered relations, merge them. The join phase is analogous to the sort-merge algorithm’s
merging stage. The primary distinction is how duplicate values in the join property are handled; any
pair with the identical value on the join attribute must be matched. Figure 3 depicts merge join:
a1 a2 a1 a3
pr ps
a 3 a A
b 1 b G
d 8 c L
d 13 d M
f 7 m B
m 5 s
q 6
Figure 3: Merge Join
6
 Hash-join: Useful for equi-joins and natural joins. To split tuples of both relations, a hash function
h is used. h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the common attributes of
r and s used in the natural join.
r 0, r 1, . . ., r n denotes partitions of r tuples
Each tuple t r  r is put in partition r i where i = h (t r [JoinAttrs]).
r 0, r 1 . . ., r n denotes partitions of s tuples
Each tuple t s  s is put in partition s i, where i = h (t s [JoinAttrs]).
Note: In book, r i is denoted as H ri, s i is denoted as H s i and n is denoted as n h.
Figure 4 depicts hash join:
0 0
1 1
2 2
3 3
s
4 4
r
partitions partitions
of r of s
Figure 4: Hash Join

r tuples in ri only need to be compared to s tuples in si. It is not necessary to compare r tuples to s
tuples in any other partition since a r tuple and a s tuple that meet the join condition will have the
same value for the join attributes.
If that value is hashed to some I the r tuple must be in r I and the s tuple must be in si.
 Complex joins: In SQL, a complex join is also known as an outside join. It isn’t always more difficult
than an inner join. The term “complicated” refers to the fact that SQL performs an inner join as well
as acquiring additional data from one or more tables.
Join this with a conjunctive condition: r  1   2  ...   n s
Use nested loops/block nested loops if possible, or compute the result of one of the easier joins. r
 is the final result is made up of the tuples from the intermediate result that meet the remaining
constraints:
1  . . .   i –1   i +1  . . .   n
Join with a disjunctive condition:
7
JGI JAIN
r  1   2  ...   n s
Use nested loops/block nested loops instead of Compute as the union of the records in each joins r 
i s: r  1 s )  ( r  2 s)  . . .  ( r  n s)
7.2.3 Other Operations Evaluation of Expressions

Duplicate removal can be accomplished using hashing or sorting. Duplicates will be next to each
other when sorted, and all but one of a set of duplicates can be eliminated. Duplicates can be removed
during run creation and at intermediate merging phases in external sort merge. Hashing means that
comparable duplicates will be grouped together. Projection is accomplished by projecting each tuple
and then removing duplicates.
Aggregation may be implemented in the same way that duplicate removal is. To gather tuples in the
same group together, sorting or hashing may be employed, and then the aggregate methods can be
applied to each group.
Optimization is used to compute partial aggregate values to combine tuples in the same group during
run generation and intermediate merges. Set operations (,  and −): can employ either a variation of
merge join or a variant of hash-join after sorting. The outer join can be calculated as an A join followed
by the inclusion of null-padded non-participating tuples by the modification of the join algorithms.
7.3 QUERY OPTIMIZATION OVERVIEW

The process of selecting an effective execution plan for processing a query is known as query optimization.
The processed query is then submitted to the query optimizer, which generates the results. Various
execution plans to analyse the parsed query and choose the option with the lowest estimated cost
The catalogue manager assists the optimizer in determining the optimum plan for executing queries
and calculating the cost of each query plan.
Query optimization is a technique for quickly accessing a database. It’s the art of getting the information
you want in a predictable, dependable, and timely manner. Query optimization is defined as the
process of changing a query into an equivalent form that can be assessed more quickly. The goal of
query optimization is to find an implementation plan that cuts down on the amount of time it takes to
process a query. We must complete two key tasks in order to attain this optimization target. The first is
to determine the optimal plan, and the second is to reduce the amount of time spent running the query
plan.
The following points describe the importance of query optimization:

 Query optimization allows for more efficient query processing.
 It has a lower cost per query.
 The database is less stressed as a result.
 It ensures the system’s high performance.
 It takes up less memory.
8
7.3.1 Types of Query Optimization

Query optimization is classified into two types are as follows:
 Cost-based optimization
 Heuristic-based optimization
Cost-Based Optimization
The optimizer assigns a cost in numerical form for each step of a feasible plan for a given query and
environment, and then discovers these values together to get a cost estimate for the plan or possible
strategy. The optimizer aims to find the plan with the lowest cost estimate after evaluating the costs of
all feasible plans. The optimizer is frequently referred to as the cost-based optimizer.
Some of the features of cost-based optimization are as follows:

 The cost of the query to be optimised is the basis for cost-based optimization.
 Based on the value of indexes, accessible sorting algorithms, constraints, and other factors, the
query can take a variety of paths.
 The goal of query optimization is to use an algorithm to select the most effective way for implementing
the query at the lowest feasible cost.
 The query Optimizer must offer the cost of executing the algorithm in order to choose the best query
for a given operation.
 The cardinality of the input also affects the cost of an algorithm.
Heuristic-Based Optimization
Heuristic-based optimization is also known as rule-based optimization. This is based on the equivalence
rule for relational expressions; as a result, the number of possible query combinations is reduced. As a
result, the query’s cost decreases.
Heuristic-based optimization applies mathematical principles to query components to build an

evaluation plan that, theoretically, will result in a shorter execution time. These are often the data items
within an internal data structure, such as a query tree, that the query parser has created from a higher-
level representation of the query (i.e., SQL).
A query tree’s internal nodes reflect particular relational algebra operations to be done on the relation(s)
supplied into them from the child node(s) directly below. The relationship is represented by the tree’s
leaves (s). The tree is examined from the ground up, resulting in a distinct assessment strategy. we saw
how a query’s query tree may be built in a variety of methods that are all identical. In many cases, at
least one of these comparable trees will generate a speedier, “optimized” execution plan.
Semantic-based query optimization is another method of query optimization. Many times, the data
inside and between relations contain “rules” and patterns based on “real-world” events that the DBMS
is unaware of. As an example, vehicles like the DeLorean were not made after 1990, so a query like
“Retrieve all vehicles with make equal to DeLorean and year > 2000” will produce zero records. Injecting
these types of semantic rules into a DBMS can thus further enhance a query’s execution time.
9
JGI JAIN
7.4 TRANSFORMATION OF RELATIONAL EXPRESSIONS

When a query is given to the database, it is the database’s obligation to determine an algorithm to
assess the query. The DBMS must choose a low-cost algorithm to assess the query, followed by the best
path to evaluate the query. The question might be simple or difficult. It must choose a better execution
path based on the cost. This is referred to as query optimization.
The number of entries in the database, the number of blocks assigned to each table, the number of
records in each block, the size of the record, duplicate records, the height of the B+ tree, constraints, and
indexes, and other parameters all have an impact on query speed. Let’s look at how to choose a better-
performing query. When a SQL query is submitted to the database, it can be assessed in several ways.
Consider the following example:
SELECT Id, Name, Designation
FROM Employee
WHERE ID = 2
The query above pulls the Id, Name and Designation from the Employee and Id tables for Id = 2.
However, when it is passed to the DBMS, it breaks the query into tokens and examines how it might
be reassembled to improve performance. The query optimizer is in charge of this. However, changing
the sequence of the tokens in the query should have no effect on the outcome. In any case, the outcome
should be the same. The order of the records is not fixed and is of the utmost importance. This is known
as an analogous inquiry. To insert tokens in the query, a series of rules must be followed. This is known
as the equivalence rule.
7.5 ESTIMATING STATISTICS OF EXPRESSION RESULTS

It examines numerous facts about the tables included in the data dictionary to identify the best strategy
for analyzing the query. This information is gathered about tables when they are created and when
different DDL / DML operations are done on them. The optimizer searches the data dictionary for the
following terms:
 The total number of records in a table nr. This will aid in determining which table should be visited
first. Smaller tables are often run first in order to minimize the size of the intermediary tables. As a
result, it is one of the critical elements to consider.
 A The total number of records in each block, fr This will be important in establishing the blocking
factor, which is necessary to assess whether or not the table fits in memory.
 The total number of blocks assigned to a table. This is also a crucial consideration when determining
the number of records that can be allocated to each block. If we have 100 records in a table and a
total of 20 blocks, we may compute fr as nr/b r = 100/20 = 5.
 Total length of the table’s records, l r. When the number of records varies greatly between any two
tables in the query, this is an essential consideration. There is no noticeable effect if the record length
is fixed. However, when variable-length records are included in the query, either average length or
actual length must be utilized depending on the kind of operations.
10
 The number of distinct values for a column, d Ar. This is useful when a query employs an aggregate
or projection technique. It will produce an estimate based on the number of columns chosen during
projection. When the Aggregation operation is performed in the query, this may be utilized to
determine the number of record groups. For example, SUM, MAX, MIN, COUNT, and so on.
 The index’s levels, x. This data indicates if a single level of an index, such as a primary key index,
secondary key indexes, or multi-level indexes, such as a B+ tree index, merge-sort index, and so on,
is utilized. These index levels will offer information on how many block accesses are necessary to get
the data.
 A column’s selection cardinality, s A. This is the number of records in the table that have the same
column value as A. This is computed as nr/d Ar, which is the total number of records with an A value.
Assume the EMP table contains 500 rows and DEPT_ID has five different values. Then, in the EMP
table, the selection cardinality of DEPT_ID is 500/ 5 = 100.
 That indicates that each department has an average of 100 people. This is useful in calculating the
average number of records that would meet the selection criteria.
 There are several more elements to consider, such as index type, data file type, sorting order, sorting
type, and so on.
7.6 CHOICE OF EVALUATION PLANS

It calculates the costs using the various ways we’ve covered so far. It verifies the data dictionary for the
operator, joining type, indexes, number of records, selectivity of records, distinct values, and so on. It
selects the optimum evaluation plan after collecting all of these data.
So far, we’ve seen how queries are parsed and navigated, how they’re evaluated using various ways,
and what the costs are when different methods are utilized. The key part of analyzing a query now is
determining which evaluation plan to use so that it can be traversed effectively. It compiles all statistics,
costs, access/evaluation pathways, relationship trees, and so on. It then analyses them and selects the
most appropriate assessment path.
It computes the expenses depending on the various ways we’ve seen thus far. It searches the data
dictionary for the operator, joining type, indexes, number of records, selectivity of records, distinct
values, and so on. Once all of this information has been gathered, it selects the optimal assessment
strategy.
7.7 LAB EXERCISE

Implement Query Optimizer with Relational Algebraic expression construction and execution plan
generation for choosing an efficient execution strategy for processing the given queries. (Use Employee
Database with relevant Tables).
The following SQL program is used to Implement Query Optimizer with Relational Algebraic expression
construction and execution plan generation for choosing an efficient execution strategy for processing
the given queries:
11
BEGIN TRANSACTION;
CREATE TABLE Employee(Id integer, Name VARCHAR, Designation VARCHAR,
Salary Integer);
INSERT INTO Employee VALUES(1,'Babita', 'Publisher', 80000);
INSERT INTO Employee VALUES(2,'Sumit', 'Sales Manager', 50000);
INSERT INTO Employee VALUES(4,'Rakesh', 'Book Editor', 70000);
INSERT INTO Employee VALUES(5,'Neeraj', 'Content Writer', 60000);
SELECT Name
FROM Employee
WHERE id
IN (SELECT id
FROM Employee
WHERE id = 4);
SELECT Salary FROM EMPLOYEE WHERE Salary >= 70000 ;

The possible relational algebra expressions for this query are:
Salary(Salary>=40000(EMPLOYEE))
Salary>=70000(SalaryEMPLOYEE)
The output of given SQL code is as follows:
1|Babita|Publisher|80000
2|Sumit|Sales Manager|50000
4|Rakesh|Book Editor|70000
5|Neeraj|Content Writer|60000
Rakesh
80000
90000
70000
 Query cost is a cost that the optimizer calculates based on how much time your query will take
(comparative with absolute clump time).
 The file scans in RDBMS (relational database management systems) reads a relation only if the
entire relation is stored in one file.
 In a linear search, the system checks each record to see if it meets the specified criteria.
 A join is a SQL action that creates a relationship between two or more database tables by establishing
a connection between them based on matching columns.
 A merge join sort both relations according to their join attribute (if not already sorted on the join
attributes).
 To split tuples of both relations, a hash function h is used.
12
 The term “complicated” refers to the fact that SQL performs an inner join as well as acquiring
additional data from one or more tables.
 The process of selecting an effective execution plan for processing a query is known as query
optimization.
7.9 GLOSSARY
 Query cost: A cost that the optimizer calculates based on how much time your query will take
(comparative with absolute clump time).
 File scans: Reads a relation only if the entire relation is stored in one file.
 Linear search: The system checks each record to see if it meets the specified criteria.
 Join operation: A SQL operation that creates a relationship between two or more database tables by
establishing a connection between them based on matching columns.
 Merge join: An operation which sort both relations according to their join attribute (if not already
sorted on the join attributes).
 Hash join: To split tuples of both relations.
 Complicated join: A SQL operation performs an inner join as well as acquiring additional data from
one or more tables.
 Query optimization: The process of selecting an effective execution plan for processing a query.

1. Describe the concept of query optimization and processing.
2. The cost of a query in a database management system can be calculated by establishing a framework
that can generate multiple designs for a query. Explain the significance of query cost.
3. Query optimization is a technique for quickly accessing a database. Discuss
4. Evaluate the concept of cost-based optimization.
5. Sorting is the method of storing records in one or more columns in ascending or descending order.
Describe the concept of sorting in query processing.

1. Query processing and optimization are essential components of every DBMS. To be used efficiently,
query results must be accessible in the timescale required by the submitting user, which might be
a human, a robotic assembly machine, or even another unique and independent DBMS. Refer to
13
2. Query cost is a cost that the optimizer calculates based on how much time your query will take
(comparative with absolute clump time). Refer to Section Measures of Query Cost Selection Operation
3. The process of selecting an effective execution plan for processing a query is known as query
optimization. Refer to Section Query Optimization Overview
4. The optimizer assigns a cost in numerical form for each step of a feasible plan for a given query and
environment, and then discovers these values together to get a cost estimate for the plan or possible
strategy. Refer to Section Query Optimization Overview
5. The sorting method is used in query processing to efficiently perform various relational operations
such as joins and so on. Refer to Section Measures of Query Cost Selection Operation
 https://www.serverwatch.com/guides/sql-query-optimization/
 https://docs.oracle.com/database/121/TGSQL/tgsql_optcncpt.htm#TGSQL204
 Discuss with your friends and classmates about the concept of query processing. Also, discuss the
importance of query optimization.
14
UNIT
08 Distributed Database
Names of Sub-Units
Introduction, Distributed Databases Principles, Transparencies in Distributed Databases, Transaction

Control in Distributed Database, Query Processing in Distributed Database
Overview
This unit begins by discussing about the concept of distributed database. Next, the unit discusses
the distributed database principles and transparencies in distributed database. Further the unit
explains the transaction control in distributed database. Towards the end, the unit discusses the query
processing in distributed database.
Learning Objectives

 Discuss the concept of distributed database
 Explain the concept of distributed database principles
 Describe the transparencies in distributed database
 Explain the significance of transaction control in distributed database
 Discuss the importance of query processing in distributed database
JGI JAIN
Learning Outcomes

 Evaluate the concept of distributed database
 Assess the concept of distributed database principles
 Evaluate the importance of transparencies in distributed database
 Determine the significance of transaction control in distributed database
 Explore the importance of query processing in distributed database
 https://cs.uwaterloo.ca/~tozsu/publications/distdb/distdb.pdf
8.1 INTRODUCTION
A distributed database is a collection of numerous conceptually interconnected databases that are
scattered across a network. A distributed database system is made up of loosely connected sites that
have no physical components in common. Each site’s database systems are independent of one another.
Transactions may get access to data at one or more locations. In a distributed database system, data is
kept across several locations, and each site is generally maintained by a DBMS that runs independently
of the others. DDBMS is made up of a single huge logical database that is broken into many parts.
DDBMS applications are frequently classified into two categories:
 Local applications
 Global applications
Users should be allowed to ask queries without identifying the location of the referenced relations,
copies, or fragments of the relations.
Users should be able to write transactions that access and change data at several locations in the same
way that they would write transactions over exclusively local data. The effect of a transaction between
sites, in particular, should remain atomic.
8.2 DISTRIBUTED DATABASES PRINCIPLES

Distributed database systems (DBS) are database architectures in which data and related functions
that operate on that data are distributed among numerous computers linked by a network.
The distributed database management system (DDBMS) is a concept in which data and maybe certain
control components are distributed among numerous computers linked by a network. The client process
that accesses such a DDBMS may be located on the same computer as the database or on a separate
machine. This client-server approach has grown in popularity in large enterprises for both internal
business functions and Internet services such as e-commerce websites and online multiplayer games
2
UNIT 08: Distributed Database JGI JAIN
(where millions of players around the world simultaneously connect to servers hosted by the provider
via an internet connection). These applications have created massive volumes of data that must be
stored someplace, accessed quickly, and modified in a variety of ways. Distributed databases provide
capabilities that centralized databases do not. They not only aid in the resolution of today’s data storage
issues, but they may also alter how programs operate on the internet and elsewhere.
Data are kept at one or more dedicated network locations that operate as servers and oversee the
retrieval, updating, and other management of that data under the client-server model. A file server
is one sort of server that maintains files saved on behalf of customers. A database server (DBMS) is
another type: it maintains enormous volumes of connected information such as customer records,
product descriptions, and so on. All of this information is physically kept on one or more huge mainframe
computers (known as databases). There are also clusters of specialized servers dedicated to certain tasks,
such as online search engines and e-commerce apps. To provide improved availability and faster access
to its information, database servers can be duplicated, for example, by keeping copies at geographically
distant sites.
Distributed database systems differ from centralized database management systems (DBMS) in that
they seek to preserve data consistency while updating sections of their data simultaneously. The
requirement to maintain a consistent view is the most significant element that distinguishes distributed
databases from other types of databases. This limitation imposes stringent constraints on how changes
must be handled and creates some challenging issues in concurrency control, transaction management,
and query processing. Distributed DBMSs should also offer availability by continuing to run despite
the failure or outage of one or more network locations; typically, this implies that the system should
continue to function if any of its nodes or links fail. Consistency can be achieved by replicating data
across different computers (“sharding”) or by using procedures such as two-phase commit.
Some of the fundamentals principles of distributed database system are as follows:
 A distributed system should appear to the user exactly like a non-distributed system.
 Local self-government
 There is no dependency on a central location
 Independence/transparency of place
 Independence from fragmentation (vertical, horizontal, or both)
 Independence from replication
 Hardware autonomy
 Independence from the operating system
 Network autonomy
 Independence of the database management system
8.2.1 Architectures
The distributed database system can use applications from both local and remote databases. The
architecture of a system affects its structure, and distributed database management systems can be
implemented in a homogeneous or heterogeneous manner.
3
JGI JAIN
The architecture will help in the comprehension of the system’s components as well as the definition of
their functions. The architecture will also specify how the various components interact with one another.
There are three types of architectures distributed database system is as follows:

 Homogenous distributed database systems
 Heterogeneous distributed database systems
 Client/server database architecture
Homogenous Distributed Database System
A HDDBS is a network of two or more Oracle Databases that run on one or more PCs. The diagram shows
a distributed system with three databases: headquarters, manufacturing, and sales. An programme
can access or change data in multiple databases at the same time in a distributed environment. A single
query from a Manufacturing client on the local database mfg, for example, can return data from both
the local database’s products table and the distant hq database’s department table.
A client application has no knowledge of the database’s location or platform. You can create synonyms
for distant items in a distributed system so that users are able to access them using the same syntax
as local objects. If you’re connected to database mfg but need to access data from database hq, for
example, you can run this query by creating a synonym for the distant dept table on mfg:
SELECT *FROM dept;
In this approach, a distributed system seems to have native data access. Users of mfg are not required
to be aware that the data they access is stored in distant databases. Figure 1 depicts the homogenous
distribution database:
Manufacturing Distributed Database Headquarters

MFG . ACME . COM HQ . ACME . COM
SALES. AcME. COM
Sales
Figure 1: Homogenous Distributed Database
4
Oracle database distributed database systems can contain Oracle Databases of various versions. A
distributed database system can use any supported release of Oracle Database. However,
programs that interact with the distributed database must be aware of the capabilities offered at
each node in the system. A distributed database application cannot expect an Oracle7 database to
recognize SQLextensions accessible exclusively with Oracle Database.
Heterogeneous Distributed Database Systems
Different schemas, operating systems, DDBMS, and data models are used in a heterogeneous distributed
database.
A particular site in a heterogeneous distributed database may be completely uninformed of other sites,
resulting in limited cooperation in processing user requests. Translations are required to establish
communication across sites due to the constraint.
In a heterogeneous distributed database system, at least one database is not an Oracle database. The
application sees the heterogeneous distributed database system as a single, local Oracle Database. The
dispersion and heterogeneity of the data are hidden by the local Oracle Database server.
The Oracle Database server connects to the non-Oracle Database system through Oracle Heterogeneous
Services and an agent. If you use an Oracle Transparent Gateway to access a non-Oracle Database data
store, the agent is a system-specific program. If you include a Sybase database in an Oracle Database
distributed system, for example, you must get a Sybase-specific transparent gateway so that the Oracle
Database in the system can interact with it. Alternatively, if the non-Oracle Database system supports
the ODBC or OLE DB protocols, you can utilize a generic connection to access non - Oracle Database data
stores.
Client/Server Database Architecture
Client-server architecture is a network architecture in which several clients (remote processors)

request and receive services from a centralised server (host computer). Client computers provide an
interface that allows a computer user to request server services and view the results returned by the
server. Servers wait for requests from clients to arrive before responding. Clients should not need to
be aware of the specifics of the system (i.e., the hardware and software) that is providing the service if
a server provides a standardised transparent interface. Clients are typically found at workstations or
on personal computers, while servers are typically found elsewhere on the network, on more powerful
devices. This computer model is particularly useful when clients and servers each have their own set
of responsibilities to complete on a regular basis. In hospital data processing, for example, a client
computer may run an application programme for entering patient data, while the server computer
runs another programme to administer the database where the data is permanently stored. Several
clients can simultaneously access the server’s information, and a client computer can conduct other
functions, such as sending e-mail. The client-server concept differs from the old “mainframe” model,
in which a centralised mainframe computer handled all activities for its associated “dumb” terminals,
because both client and server computers are considered intelligent entities.
5
JGI JAIN
Figure 2 depicts client/server database architecture:
Client Client
TCP Connections
Server
Client
Client
Figure 2: Client/Server Database Architecture
8.2.2 Design
A distributed database system’s design is a difficult task. As a result, a careful analysis of the strategies
and objectives is required. The following are some of the general strategies and objectives found in most
DBS designs:
 Data replication
 Data allocation
 Data fragmentation
Data Replication
The storing of data copies at several network locations is known as data replication. Fragment copies
can be kept in many locations, increasing data availability and reaction speed. A mutual consistency
rule applies to replicated data. To maintain data consistency throughout all replications, this rule
demands that all copies of the data pieces be identical.
Although data replication improves availability and reaction times, it might be difficult to maintain
the replications. If data is replicated across many sites, for example, the DDBMS must select which copy
to access. The nearest copy is all that is necessary to satisfy a transaction for a query operation. If the
operation is an update, however, all copies must be picked and modified in order to meet the mutual
consistency criterion.
A database can be either fully replicated, partially replicated or unreplicated as shown in Table 1:
Table 1: Types of Database Replication
Database Description
Full replication Stores several copies of each database fragment at different locations. Because of the
amount of overhead put on the system, fully replicated databases might be unfeasible.
Partial replication Stores numerous copies of some database pieces at different locations. Most DDBMS are
capable of handling this form of replication quite effectively.
6
No replication Each database fragment is stored at a single location. There is no duplicate.
Data replication is very effective when the frequency of use of distant data is frequent and the database
is somewhat large. Another advantage of data replication is the ability to restore lost data at a specific
location.
Data Allocation
The process of determining where to store data is known as data allocation. It also entails deciding
which data should be kept where. Data might be allocated centralised, partitioned, or duplicated as
shown in Table 2:
Table 2: Allocation of Data
Centralised The complete database is kept in a single location. There is no dispersion.
Partitioned The database is separated into parts that are kept at several locations.
Replicated Copies of one or more database pieces are kept at several locations.
Data Fragmentation
Data fragmentation is the process of dividing a single object into two or more segments. A user’s
database, a system database, or a table could be the object. Over a computer network, any fragment can
be stored at any location. The distributed data catalog (DDC) stores information on data fragmentation,
which is retrieved by the TP to process user requests. At the table level, data fragmentation strategies
are based on dividing a table into logical fragments. There are three types of data fragmentation are as
follows:
 Horizontal fragmentation: The division of a relation into subsets (fragments) of tuples is referred to
as horizontal fragmentation (rows). Each fragment is stored in its own node and has its own set of
rows. The unique rows, on the other hand, all have the same properties (columns). In a nutshell, each
fragment is a SELECT statement with a single WHERE clause on a single attributes.
 Vertical fragmentation: The division of a relation into attribute (column) subsets is referred to as
vertical fragmentation. Each subset (fragment) is stored at a separate node and it has its own set of
columns, with the exception of the key column, which is shared by all fragments. This is similar to
the PROJECT statement in SQL.
 Mixed fragmentation: A mix of horizontal and vertical fragmentation strategies is referred to as
mixed fragmentation. To put it another way, a table can be broken into numerous horizontal subsets
(rows), each with its own set of properties (columns).
8.2.3 Implementation
A distributed database is made up of several databases that are geographically distributed throughout
the world. A distributed database management system (DBMS) manages the distributed database and
makes it appear to users as a single database.
7
JGI JAIN
A distributed database is a collection of many interconnected databases that are physically scattered
across multiple places and interact over a computer network.
Features in Distributed Database Implementation
The databases in the collection are logically connected to one another. They frequently represent a single
logical database. Data is physically kept in numerous locations. Each site’s data can be maintained by
a DBMS that is independent of the other sites The processors at the locations are linked via a network.
A distributed database does not have a multiprocessor architecture. A distributed database integrates
transaction processing, but it is not synonymous with a transaction processing system.
8.3 TRANSPARENCIES IN DISTRIBUTED DATABASES

The feature of distributed databases that conceals the internal specifics of the distribution from users
is known as distribution transparency. The DDBMS designer has the option of fragmenting tables,
replicating the pieces, and storing them at separate locations. However, because users are unaware of
these intricacies, they regard the distributed database to be as simple to use as any centralized database.
There are four different types of transparencies in a DDBMS:

 Transaction transparency
 Performance transparency
 DBMS transparency
 Distribution transparency
Figure 3 depicts the four types of transparencies in DDBMS:
Transaction Performance
transparency transparency
DDBMS
Transparency
DBMS Distributed
transparency transparency
Figure 3: Transparencies in DDBMS
Transaction Transparency
The transaction transparency makes that all distributed transactions maintain distributed database
integrity and regularity. Distribution transaction access refers to data stored in different locations.
Another point to keep in mind is that the DDBMS is responsible for maintaining the atomicity of each
sub-transaction (this means that the entire transaction either occurs directly or not at all). Due to the
use of fragmentation, allocation, and replication structures in DBMS, it is extremely complicated.
8
Performance Transparency
A DDBMS that works in a similar way to a centralised database management system is required for
this transparency. Also, the system is distributed; there should be no performance loss. In a DDBMS, a
distributed query processor must be able to transform a data request into an ordered sequence of local
database operations. Another factor to consider is the database management system’s fragmentation,
replication, and allocation structure.
DBMS Transparency
This transparency is only applicable to heterogeneous forms of DDBMS because it hides the fact that the
local DBMS may be different (databases that have different sites and use different operating systems,
products, and data models). This is one of the most difficult transparency generalisations to use.
Distribution Transparency
The user does not need to know that the data is fragmented when a DDBMS has distribution data
transparency since it lets them recognise the database as a single object or logical unit. Distribution
transparency is further divided into three types:
 Location transparency: Location transparency guarantees that any table(s) or fragment(s) of a
table may be queried as if they were stored locally on the user’s site. The end-user should be fully
unaware that the table or its parts are kept at a remote site in the distributed database system. The
remote site(s) address (es) and access procedures are entirely concealed.
To include location transparency, DDBMS should have access to an up-to-date and accurate data
dictionary, as well as a DDBMS directory that contains information about data locations.
 Fragmentation transparency: Users can query any table as if it were unfragmented thanks to
fragmentation transparency. As a result, it conceals the fact that the table on which the user is
searching is actually a fragment or union of some fragments. It also conceals the fact that the
shards are scattered around the country.
This is analogous to SQL views, where the user may be unaware that they are using a view of a table
rather than the table itself.
 Replication transparency: Replication transparency guarantees that database replication is
concealed from users. It allows users to query a table as though there is just one copy of the table.
Transparency in replication is linked to concurrency and failure transparency. When a user makes
a change to a data item, the change is mirrored in all copies of the table. However, the user should
not be aware of this procedure.
This is known as concurrency transparency. In addition, if a site fails, the user can continue with his
requests utilizing duplicated copies without being aware of the failure. This is an example of a lack
of transparency.
8.4 TRANSACTION CONTROL IN DISTRIBUTED DATABASE

Transaction is a unit of database operation. It can be a read operation, a write operation, or a read/
write operation. We can say that a transaction is a sequence of many actions grouped together and
9
JGI JAIN
considered as an atomic unit of work. A query written in Structured Query Language (SQL) may involve
more than one transaction that includes read or write the data from various tables. Every day, many
database systems deal with a large number of transactions that renamed the database system as a
transaction processing system.
A transaction can involve four types of operations, which are described as follows:
 READ (A): Reads data from the database tables that are already stored in the memory, where A
signifies the record that is to be read.
 WRITE (A): Writes the record A that may be the result of a transaction back to the main memory or
to the disk.
 COMMIT: Makes sure that all the changes made by the transactions are made permanent.
 ROLLBACK: Terminates the transaction in case of an unsuccessful transaction and undoes any
changes made in the database.
Let’s take an example to clear the idea about how transaction takes place in case of a library management
system when a user is about to borrow a book B from the library online. Let’s assume that the account
number for the customer is A.
ACID Property
All transactions in a database follow certain properties, which help them to remain efficient. These
properties are called Atomicity-Consistency-Isolation-Durability properties or in short ACID properties.
We will define these properties in detail as follows:
 Atomicity: Ensures that the whole unit of transaction should be performed at a time. We can say
that a transaction should be done completely or not at all. Therefore, the transaction needs to be
performed completely; otherwise, the whole transaction should be rolled back. This concept of
atomicity imposed on every transaction helps maintaining the log of every transaction.
 Consistency: Means if we consider the database on which a transaction is about to be performed is
in State_1, then after the successful completion of the transaction, the state of the database should
be moved to State_2. This defines the consistency property, which ensures that the transaction
commit operation always transforms a database from one consistent state to another consistent
state. When a transaction fails, this property ensures that the database should return to its previous
consistent state.
 Isolation: Refers to a very important property since it helps to execute the transaction without fail
or without any error. This property ensures that all the changes should be reflected on the database
until the transaction is committed. Isolation guarantees that the concurrent execution of the
transaction on the same database and on the same table should not affect all other transactions.
For example, in the library management system, if a user A places a request to buy book B, and at
the same time, the administer places a request that changes the book name B to B1, then this change
(whether fail or commit) should not reflect to the first transaction done by user A. Only the state that
has been read by transaction 1 should determine the result of the transaction.
 Durability: Confirms that the changes made by the successful execution should be made permanent.
Thus, the changes should never be lost due to any kind of failure. We can say that transaction is
10
a basic unit of recovery. In a database, a log has been maintained that keeps the record of every
committed transaction.
State of Transaction
A transaction in its life cycle can have many states. When a transaction first hits a system, it starts as a
new program and waits for the CPU execution. When the CPU becomes free to execute the transaction,
its state changes from Wait to Execute. In the execution phase, a database goes to an inconsistent state
since the transaction changes the data values of the database. Here, if any internal error occurs, the
transaction may move into the Abort state; otherwise, for a successful execution of the transaction, it
moves to a consistent state known as Commit state, where all the changes made by the transaction are
made permanent. In case of any error during the execution, the transaction may go to the Rollback
state, where all the changes made by the transaction in the database are undone and the database is
moved to a consistent state that it had before the transaction. From here, the same transaction can be
executed as a new transaction back again.
Figure 4 shows the life cycle of a transaction:
Execute Abort
Wait Commit
Rollback
Figure 4: Transaction Life Cycle
The main difference between the Abort and Rollback states shown in Figure 4 is that in the Abort state,
no changes are made to the database at all; thus, there is no need to undo the changes, the database
remains in its consistent state. On the other hand, in the Rollback state, the changes made by the
transaction should reflect on the database and these changes should be undone to keep the database in
its consistent state. In the Abort state, the transaction cannot be executed again as a new transaction,
whereas in the Rollback state, it can be executed again as a new transaction, which means that the
transaction goes into transaction queue.
To allow dispersed transactions, a new component called the global transaction manager is added.
The transaction’s origin site might temporarily assume the function of global transaction manager,
coordinating database operations with transaction managers from other sites. Transaction managers
expose their capabilities to application applications via an interface.
11
JGI JAIN
Each transaction’s bookkeeping information, such as a unique identity, originating site, name, and
so on, is stored by the manager. If valid and available, it returns a local copy for READ operations. It
ensures that updates are visible across all sites that hold copies (replicas) of the data item for WRITE
operations. The manager ensures that no transaction consequences are reflected in any site of the
distributed database for ABORT procedures. It ensures that the consequences of a write are persistently
recorded on all databases having copies of the data item for COMMIT actions. The two-phase commit
technique is often used to implement atomic termination (COMMIT/ ABORT) of distributed transactions.
The database operation and associated information are passed from the transaction manager to
the concurrency controller. The controller is in charge of obtaining and releasing associated locks.
The transaction is postponed until the lock is acquired if it requires access to a locked resource. The
operation is passed to the runtime processor after the lock is gained, which handles the actual database
operation execution. Locks are released once the process is completed, and the transaction manager is
updated with the results.
8.4.1 Timestamp-Based Protocol

The drawbacks that we have learned about 2PL protocol can be overcome in timestamp-based protocol
by using timestamps. Before explaining the protocol in detail, we need to know about the timestamp and
how it has been measured in case of a transaction.
Defining Timestamp
When a transaction first enters in a system, timestamp is the unique way to identify the transaction by
imposing the timestamp value in the transaction. We should make sure that no two transactions have
the same timestamp. Timestamp is measured as the start time of a transaction. We have different ways
to measure the timestamp for a transaction, which are as follows:
i. Whenever the first transaction enters the system, we should impose the timestamp value 1 and
maintain a counter that increments the value by 1 and assigns that value as timestamp for the next
transaction. Thus, the timestamp values for the next transactions will be 2, 3, 4, etc.
ii. Whenever a transaction enters the system, we can assign the system clock time as a timestamp
value of the transaction. In this case, the system should ensure that no two transaction enters the
system at the same time.
To describe the timestamp-based protocol, we will mention the timestamp for a transaction T as
TIMESTAMP (T). If Ti is the first transaction that enters in the system and Tj is the second transaction,
then we can say that the timestamp value of Ti should be smaller than the timestamp value of Tj. Thus,
we can write Tj > Ti. Since timestamp has been use to maintain the serializability for a schedule, it
should maintain the order of execution to produce a schedule, which is equivalent to a schedule that
follows the same order.
We can think of maintaining two types of timestamps depending on the types of operation in a
transaction. The types of timestamps are as follows:
 WRITE_TIMESTAMP (A): Presents the WRITE operation timestamp, which is the largest timestamp
of all the transactions that have written the data item A successfully in the database. Thus, if T is the
12
youngest transaction that has written A, then we can write TIMESTAMP (T) = WRITE_TIMESTAMP
(A).
 READ_TIMESTAMP (A): Presents the READ operation timestamp, which is the largest timestamp
of all the transactions that have read the data item A successfully in the database. Thus, if T is the
youngest transaction that has read A, then we can write TIMESTAMP (T) = READ_TIMESTAMP (A).
We can divide the timestamp ordering or timestamp-based protocol into three categories, where the
first represents basic timestamp-based protocol, the second represents strict timestamp-based protocol,
and the final is Thomas’s write rule.
8.4.2 Basic Timestamp-Based Protocol

Every schedule is made of conflict read and write operations, and the protocols have been developed
to maintain the concurrency of the transaction intact. In timestamp ordering protocol, timestamp has
been used to deal with the conflict read and write operations. Some rules have been defined in every
protocol, which the system needs to follow. The rules for basic timestamp-based protocol are as follows:
1. The transaction T issues a READ (A) operation. There are two cases in this rule, which are as follows:
Case 1: If TIMESTAMP (T) < WRITE_TIMESTAMP (A), then the transaction T reads a value of A that
has already been overwritten by another transaction whose timestamp is younger than T. Thus, the
READ operation issued by transaction T must be aborted or roll backed.
Case 2: If TIMESTAMP (T) > = WRITE_TIMESTAMP (A), then if the two transactions took place at
the same time or if the transaction T issued WRITE operation that would take place after the READ
operation, then we can execute the READ operation and set the value of READ_TIMESTAMP (A) to
the maximum of READ_TIMESTAMP (A) and TIMESTAMP (T).
2. The transaction T issues a WRITE_TIMESTAMP (A) operation. In this rule also, there are two cases,
which are as follows:
Case 1: If TIMESTAMP (T) < READ-TIMESTAMP (A) or if TIMESTAMP (T) <
WRITE_TIMESTMAP (A), then a younger transaction with the timestamp value less than T
already READ or WRITE the value of A and hence we need to abort the WRITE operation issued by
transaction T
Case 2: If the preceding condition is not satisfied, then execute the WRITE operation and set
the timestamp value WRITE_TIMESTAMP (A) to the maximum of WRITE_TIMESTAMP (A) and
TIMESTMAP (T).
Thus, basic timestamp-based protocol ensures conflict serializability by allowing to abort a transaction
when it does not follow the rules. The advantage of this protocol over 2PL is that we can avoid deadlock
of a transaction by allowing concurrent execution. However, starvation may occur if a transaction is
continually being aborted.
8.4.3 Strict Timestamp-Based Protocol

Strict timestamp-based protocol is a variation of basic timestamp-based protocol that ensures a
schedule to be strict and serializable. In this protocol, if a transaction T issues a READ operation, and if
13
JGI JAIN
TIMESTMAP (T) > WRITE_TIMESTAMP (A), then the WRITE operation for the other transactions in queue
can be delayed until transaction T commits or aborts. The same rule holds for the WRITE operation
also. To implement this protocol, we need to simulate the locking operation along with the timestamp
technique, which ensures that when the locks are released, only then the other operations can take
place. This protocol also helps to avoid deadlock.
8.4.4 Thomas’s Write Rule

Thomas’s write rule is another variation of basic timestamp-based protocol and it does not ensure
conflict serializability. It imposes some additional rules on the WRITE operation as follow:
If a transaction T issues a WRITE operation on the data item A, then the following conditions hold:
 If TIMESTMAP (T) < READ_TIMESTAMP (A), then the system already has a greater value of READ
operation, which implies the value of A that are to be written by the transaction T has previously
been used and now the value is obsolete. So, the WRITE operation issued by transaction T has been
rolled back.
 If TIMESTAMP (T) < WRITE_TIMESTAMP (A), then the system already has a greater value of WRITE
operation, which states the value that has been written by transaction T is an obsolete value. Thus,
the WRITE operation issued by transaction T has been aborted.
 If none of the preceding conditions is true, then execute the WRITE operation and set the value of
WRITE_TIMESTAMP (A) as the maximum value of WRITE_TIMESTAMP (A) and TIMESTAMP (T).
8.4.5 Optimistic Concurrency Control

The transaction either needs to lock the data item or it needs to hold the value of timestamp before
the execution. In our multi-version scheme also, we need to keep multiple versions of the same data
item just to make our READ operation concurrent. All these protocols involve overhead as they either
need to maintain lock information or require timestamp or need to handle so many versions. Optimistic
concurrency control technique does not rely on these parameters; it checks for serializability at the end
of the transaction. It maintains a local copy of the data items on which the operations will be made
and any update operation that has been performed on the local copy of the data. At the end of each
transaction, a validation procedure is performed. If the result of the validation is true, then the updated
values are written on the database and the transaction commits this way. Optimistic concurrency
control technique comprises the following three phases:
 READ phase: If transaction T issues a READ operation, then it reads the data item and stores it in
the private workplace.
 VALIDATION phase: In this phase, it is checked that the value of data item A, which the transaction
T reads, has been changed by some other transaction while T is busy with updating its local copy
holding the value A. This checking ensures serializability of the transaction. If the validation is true,
then the updated data on the local copy are written in the database and the transaction commits.
 WRITE phase: This phase writes the data from the local copy to the database.
14
8.4.6 Distributed Timestamp Concurrency Control

The physical clock reading determines the timestamp of any transaction in a centralized system.
However, because local physical/logical clock readings are not globally unique, they cannot be utilized
as global timestamps in a distributed system. As a result, a timestamp is made up of the site ID and the
clock reading for that site.
Each site has a scheduler that maintains a separate queue for each transaction manager in order
to perform timestamp ordering techniques. A transaction manager sends a lock request to the site’s
scheduler during the transaction. The scheduler routes the request to the appropriate queue in ascending
timestamp order. Requests are handled from the front of the queues in the order of their timestamps,
i.e., the oldest requests are processed first.
8.4.7 Conflict Graphs

Another approach is to draw conflict graphs. Classes are defined for this transaction. A transaction
class has two data sets known as the read set and the write set. A transaction belongs to a certain
class if its read set is a subset of the class’ read set and its write set is a subset of the class’ writes set.
Each transaction makes read requests for the data items in its read set during the read phase. Each
transaction sends out write requests during the writing phase.
A conflict graph is generated for the classes that include active transactions. There are vertical,
horizontal, and diagonal edges in this. A vertical edge links two nodes inside a class and signals class
conflicts. A horizontal edge joins two nodes from different classes and indicates a write-write conflict
between them. A diagonal edge links two nodes from different classes and indicates a write-read or
read-write conflict between the two classes.
The conflict graphs are examined to determine if two transactions within the same class or two
transactions from separate classes may be conducted concurrently.
8.4.8 Distributed Optimistic Concurrency Control Algorithm

The optimistic concurrency control algorithm is extended by the distributed optimistic concurrency
control algorithm. Two rules are used for this extension:
 Rule 1: states that when a transaction runs, it must be verified locally at all locations. A transaction
is aborted if it is discovered to be invalid at any point. Local validation ensures that the transaction
remains serializable at the locations where it has been performed. A transaction is globally validated
once it has passed the local validation test.
 Rule 2: states that if a transaction passes the local validation test, it should be globally verified.
Global validation guarantees that if two conflicting transactions occur concurrently at more than
one site, they commit in the same relative order at all of them. This may need a transaction waiting
for the other conflicting transaction to complete before committing. Because a transaction may
not be able to commit as soon as it is validated at a site, this requirement makes the algorithm less
optimistic.
15
JGI JAIN
8.5 QUERY PROCESSING IN DISTRIBUTED DATABASE

In a distributed database management system, query processing requires data transmission across a
network of computers. In a database system, a distribution strategy for a query is the ordering of data
transmissions and local data processing. A query in a Distributed DBMS typically requires data from
numerous sites, and this requirement for data from multiple sites is referred to as data transmission,
which results in communication expenses. Due to the communication cost of data transfer over the
network, query processing in DBMS differs from query processing in centralised DBMS. When sites
are connected via high-speed networks, the transmission cost is modest, but it is considerable in other
networks. A distributed database query is performed in the following stages:
 Query mapping: A query language is used to officially specify the input query on distributed data.
It is then converted into an algebraic inquiry about global relations. This translation is performed
by reference to the global conceptual schema and does not account for data distribution and
replication. As a result, this translation is nearly equivalent to that performed in a centralized
DBMS. It is normalized first, then examined for semantic faults, simplified, and lastly reorganized
into an algebraic query.
 Regionalization: Fragmentation in a distributed database causes relations to be kept in distinct
locations, with certain pieces perhaps being duplicated. Using data distribution and replication
information, this stage converts the distributed query on the global schema to distinct queries on
individual fragments.
 Query Optimization across the board: The process of picking a strategy from a list of alternatives
that are closest to optimum is known as optimization. Permuting the sequence of operations inside
a fragmented query created by the previous stage yields a list of possible inquiries. Time is the
preferred unit of cost measurement. The total cost is a weighted average of costs such as CPU, I/O,
and communication costs. Because DDBs are linked via a network, communication expenses over
the network are frequently the most important. This is especially true if the sites are linked by a wide
area network (WAN).
8.5.1 Costs of Distributed Query Processing

We spoke about the challenges that come up while processing and optimizing a query in a centralized
DBMS. Several additional considerations hinder query processing in a distributed system. The first is the
cost of data transport via a network. This data comprises both intermediate files that are transported
to other sites for additional processing and final result files that may need to be transferred to the site
where the query result is required. Although these expenses may not be considered if the locations are
linked by a high-performance local area network, they can be significant in other types of networks.
As a result, DDBMS query optimization algorithms use the aim of decreasing data transmission as an
optimization criterion when selecting a distributed query execution method. We’ll show you how with
two easy example searches. Assume that the EMPLOYEE and DEPARTMENT relations spread at the two
locations. In this case, we will suppose that neither relation is broken. The size of the EMPLOYEE relation
is 100 * 10,000 = 106 bytes, and the size of the DEPARTMENT relation is 35 * 100 = 3500 bytes. Consider the
query Q: For each employee, retrieve the employee name and the name of the department for which the
employee works. This can be stated as follows in the relational algebra:
Q: πFname,Lname,Dname(EMPLOYEE Dno=Dnumber DEPARTMENT)
16
 A distributed database is a collection of numerous conceptually interconnected databases that are

scattered across a network.
 The distributed database management system (DDBMS) is a concept in which data and maybe
certain control components are distributed among numerous computers linked by a network.
 The architecture of a system affects its structure, and distributed database management systems
can be implemented in a homogeneous or heterogeneous manner.
 A HDDBS is a network of two or more Oracle Databases that run on one or more PCs.
 A node is a machine in a network that can host one or more databases.
 A database server is the Oracle software that manages a database,
 A client server is an application that requests data from a server.
 The hq database host acts as a database server when a statement is issued against its local data
 The storing of data copies at several network locations is known as data replication.
 The process of determining where to store data is known as data allocation.
 Data fragmentation is the process of dividing a single object into two or more segments.
 The feature of distributed databases that conceals the internal specifics of the distribution from
users is known as distribution transparency.
 Transaction is a unit of database operation. It can be a read operation, a write operation, or a read/
write operation.
 Strict timestamp-based protocol is a variation of basic timestamp-based protocol that ensures a
schedule to be strict and serializable.
 Thomas’s write rule is another variation of basic timestamp-based protocol and it does not ensure
conflict serializability.
8.7 GLOSSARY
 Distributed database: A collection of numerous conceptually interconnected databases that are

scattered across a network.
 Distributed Database Management System (DDBMS): A concept in which data and maybe certain
control components are distributed among numerous computers linked by a network.
 Architecture: A system that affects its structure, and distributed database management systems
can be implemented in a homogeneous or heterogeneous manner.
 HDDBS: A network of two or more Oracle Databases that run on one or more PCs.
 Node: A machine in a network that can host one or more databases.
 Database server: The Oracle software that manages a database,
17
JGI JAIN
 Client server: An application that requests data from a server.

 Data replication: The storing of data copies at several network locations.
 Data allocation: The process of determining where to store data.
 Data fragmentation: The process of dividing a single object into two or more segments.
 Distribution transparency: The feature of distributed databases that conceals the internal specifics
of the distribution from users.
 Transaction: A unit of database operation. It can be a read operation, a write operation, or a read/
write operation.
 Strict timestamp-based protocol: A variation of basic timestamp-based protocol that ensures a
schedule to be strict and serializable.
 Thomas’s write rule: A variation of basic timestamp-based protocol and it does not ensure conflict
serializability.

1. A distributed database system is made up of loosely connected sites that have no physical components
in common. Describe the concept of distributed database.
2. Explain the significance of data replication.
3. Transaction is a unit of database operation. Discuss
4. Evaluate the importance of homogenous distributed database system.
5. Each node in a distributed database system can function as a client, a server, or both. Discuss

1. A distributed database is a collection of numerous conceptually interconnected databases that are
scattered across a network. Refer to Section Introduction
2. The storing of data copies at several network locations is known as data replication. Refer to Section
Distributed Databases Principles
3. A transaction is a sequence of many actions grouped together and considered as an atomic unit of
work. Refer to Section Transaction Control in Distributed Database
4. A HDDBS is a network of two or more Oracle Databases that run on one or more PCs. Refer to
Section Distributed Databases Principles
5. A database server is the Oracle software that manages a database, whereas a client is an application
that requests data from a server. Refer to Section Distributed Databases Principles
18
 https://vulms.vu.edu.pk/Courses/CS712/Downloads/Principles%20of%20Distributed%20
Database%20Systems.pdf
 https://www.google.co.in/books/edition/Principles_of _Distributed_Database_Syste/
TOBaLQMuNV4C?hl=en&gbpv=1&dq=distributed+database&printsec=frontcover
 Discuss with your friends and classmates the concept of distributed database. Also, discuss the
importance of distributed database.
19
UNIT
09 Data Warehousing
Names of Sub-Units
Introduction to Data Warehousing, Need for Data warehousing, Features of a Data Warehouse, Benefits
of Data Warehousing, Data Warehouse, Common Architecture for Data Warehouse, Data Warehousing
Design, Data Warehouse Design Strategies, Data Warehousing Life Cycle Data Warehouse and Data
Marts
Overview
The unit begins by discussing the concept of data warehouse. Next, the unit explains the need and
features of data warehouse. Further, the unit explains the data warehouse architecture and the
common architecture for data warehouse. Then, the unit discusses the data warehousing design. The
unit also explains the data warehouse design strategies. Towards the end, the unit discusses the data
warehouse and data marts.
Learning Objectives

 Describe data warehouse and its features
 Explains the data warehouse architecture and the common architecture for data warehouse
 Defines the data warehousing design
 Explains the data warehouse design strategies
 Describe the data warehouse and data marts
JGI JAIN
Learning Outcomes

 Assess the knowledge about data warehouse and its features
 Examine the data warehouse architecture and the common architecture for data warehouse
 Analyse the data warehousing design
 Evaluate the concept of data warehouse design strategies
 Assess the knowledge the data warehouse and data marts
 https://docs.oracle.com/en/database/oracle/oracle-database/21/dwhsg/database-data-
warehousing-guide.pdf
9.1 INTRODUCTION
A data warehouse is a database management system that facilitates online analytical processing by
allowing the data to be viewed in different dimensions or perspectives, to provide business intelligence.”
Data warehouse can also be referred as a collection of data designed to support decision-making process
of the management. Data warehouses contain various data that present a clear picture of the business
conditions at a single instant of time. The development of a data warehouse includes the development of
systems that are used to extract data from operating systems. The installation of a warehouse database
provides the managers with access to the data in a flexible manner. Data warehousing generally refers
to the combination of several different databases across a whole enterprise.
In 1980, Bill Inmon coined the term ‘data warehouse’. According to him, “A data warehouse is a subject-
oriented, integrated, time-variant and non-volatile collection of data in the support of management’s
decision-making process.” In this definition, it is very important to emphasis words like subject-oriented,
integrated, time-variant and non-volatile.
“Subject-oriented” signifies that the data points to critical subjects such as sales, inventory, etc.
“Integrated” signifies that the data is generated from various sources. “Time-variant” implies that the
data is stored in such a way that all the updated data is stored with the time it is updated. “Non-volatile”
implies that data is never removed, i.e., historical data is also kept. So, whenever a change takes place
in any of the field values of a table, the previous values also need to be kept in addition to the present
values.
9.2 THE NEED FOR DATA WAREHOUSING

In today’s competitive business environment, information is power, and to make a place in such an
environment, decision makers need to take right decisions for which they need the right information
at the right time. RDBMS (Relational Database Management System) is extensively used by all
2
UNIT 09: Data Warehousing JGI JAIN
organisations to obtain routine management information, which is required to take strategic decisions.
This information is collected through a number of complex queries.
For instance, the database of a TV manufacturing company contains a lot of information about the
profile of purchasers, time of purchase, model purchased, etc. The management may like to know the
answer to complex questions such as “which is the fastest selling model of TVs, in which period the
highest number of sets of this model have been sold, and what is the profile of the customers buying this
model?” This information is important to plan the production schedule and also to change the marketing
strategy so that the marketing team can focus on a particular market segment during a particular
period of time in a year. Such ad hoc complex information that needs to be obtained from the historical
data stored in the database, after necessary analytical processing, is called business intelligence. The
systems that provide business intelligence are now known as ‘data warehouses’.
Data warehousing in coordination with business intelligence tools and techniques facilitates decision
making by taking care of the following:
 Risks management
 Cost control
 Revenues or business results
 Customer satisfaction
Each decision can significantly impact organisations by reducing costs and increasing revenue through
the use of DWH and BI tools, for the following:
 Avoiding problems, such as poor credit risks
 Negotiating improvement in supply
 Dropping unprofitable products
 Reducing waste due to poor quality
 Understanding and providing better customer service
 Focusing on the most profitable products
 Improving cross-selling to customers
 Capitalising on trends
 Capturing growing marketing opportunities
Besides these, the need for data warehousing also arises due to the following reasons:
 Increasing demand for strategic information
 Inability of past decision support system
9.3 FEATURES OF A DATA WAREHOUSE

We have aforementioned that a data warehouse is a collection of data, which has certain features. The
features of a data warehouse are as follows:
 Subject-oriented: Data warehouses are designed to help us examine data for supporting decision-
making process of the management. For example, learning about sales data of a particular
3
JGI JAIN
organisation, a warehouse can be built that concentrates on sales. With the help of this warehouse,
we can know about the best customers for a particular item in the previous year. Since we can
define a data warehouse by a specific subject matter (in this case, sales) data warehouse is subject-
oriented.
 Integrated: Integration is related to subject orientation closely as both are concerned with the
design of the data warehouse. The data should be put by data warehouses from different sources
into a consistent format. The problems among the units of measure, such as naming conflicts and
inconsistencies, must be resolved. When the problems are resolved, they are integrated.
 Non-volatile: Non-volatile expresses that the data should not change once they have entered into
the data warehouse. This is logical as the aim of a warehouse is to provide us with the ability to
examine what has happened.
 Time-variant: To find general developments and changes in business, analysts require huge amount
of data. This is highly in contrast to Online Transaction Processing (OLTP) systems, where moving
of historical data to an archive is a requirement. The term ‘time-variant’ means a data warehouse’s
attention to changes taking place over a period of time. Typically, on a monthly, weekly, or daily
basis, data flows from one or more OLTP databases into a data warehouse. A staging file is used to
process data, which flows in before adding them to the data warehouse.
 Data granularity: It refers to the level of detail. Finer the data granularity, lower the details are
stored. In operational systems, data are stored with the lowest level of detail. However, we may
require data which are the aggregation of the lowest level of details. For example, sales details may
be stored on daily basis in the database; however, for analysis, we may require them on a monthly,
quarterly or yearly basis. In a data warehouse, we can keep data at the different levels of granularity
depending on the queries executed.
9.4 BENEFITS OF DATA WAREHOUSING

The successful implementation of a data warehouse brings various benefits to the organisation. These
benefits are:
 Allows accessing to a wide variety of data which are consistent and integrated from a number of
different resources
 Enables analysis based on the data available for the organisation that can be done to improve
business by observing previous trends and predicting the future, so that appropriate measures can
be taken to improve the business
 Presents results by using different forms, i.e, reports, graphs, etc.
 Helps to provide better enterprise intelligence by enhanced customer service
9.5 DATA WAREHOUSE ARCHITECTURE

A data warehouse obtains data from a number of operational database systems, which can be based
on RDBMS or ERP packages. The data from various data sources is converted into a form suitable to be
stored in a data warehouse. This process is called Extract, Transform and Load (ETL) of data into the
4
target database. In addition to the target database, there is another database to store the metadata,
called metadata repository or metadata database. This database contains data about data—description
of source data, target data and how the source data has been modified into target data. The basic
architecture of a data warehouse is shown in Figure 1:
Data Warehouse
Metadata Admin Console
database
Network Network LAN/

LAN/WAN WAN
Target Database
(Data ETL)
Data Sources End-user

BI Applications
Figure 1: Data Warehouse Architecture
Figure 1 describes the architecture of a data warehouse. In this figure, the client software is used to
generate the business intelligence reports. The data from the data sources is subjected to the ETL
process and stored in the target database (Data ETL). The stored data is then processed and the outcome
is generated on the basis of the users’ queries. The user interface can be based on a custom-built GUI
or it can be a browser interface. The administration console is used to carry out data warehouse
administration including the periodic ETL process to update the data into the data warehouse from the
data sources. We will describe each of these components in the next sub-sections.
9.5.1 Data Sources

An organisation interested in developing a data warehouse have to first study where the existing
information is located, i.e., what are the available data sources? Some organisations use databases that
were developed decades ago, while some use advanced systems such as ERP, Customer Relationship
Management (CRM) and Supply Chain Management (SCM) packages. So, data sources can be as:
 Flat files, which contain the data with one record per line in a proprietary format.
 Relational databases based on Oracle, MS SQL, Sybase, Informix, IBM DB2, etc. with Open Database
Connectivity (ODBC).
 ERP and CRM applications such as Oracle Applications, JD Edwards, PeopleSoft, SAP, etc.
 Mainframe databases such as DB2, VSAM
If an organisation has offices located in different places (perhaps in different countries), it is likely that
the database systems were also developed without any standardisation of the field names and their
5
JGI JAIN
attributes. While taking/using data from various data sources, a data warehouse developer must keep
the following points in mind:
 The character representation of data taken from different data sources may be different (most
computers use ASCII code; the mainframes use EBCDIC format).
 The table names may be different (customer details table can be customer_details in one database
and cust_details in another database).
 The field names may be different (emp_name and Employee_name; both mean the same thing for
us but for computers, they are different).
 The date may be stored in different formats in different databases (16 Sept. 15 or 16/9/15 or 9/16/2015).
 Units of measure may be different (liters or gallons).
 Currency values may be different (price field may contain dollars but in one database, US dollars
and in another database, Singapore dollars).
 In some databases, the fields may be NULL.
Data warehouse specialists need to study the minute details of these data sources before starting with
the development of the warehouse.
9.5.2 Data Staging

Data staging is the process of transferring data from various data sources (operational systems)
into the target database of the data warehouse using the Extract, Transform and Load (ETL) process.
Figure 2 shows the data staging process:
Extract Cleaning Transform Load
Database
Quality Assurance Server
Data Staging
Operational
Systems
Figure 2: Data Staging
In other words, you can say that data staging is the process of data extraction from the source database,
carrying out the necessary transformations and loading into the target database as well as carrying
out the data quality assurance before a formal release of the data warehouse.
6
9.5.3 ETL Process

As shown in Figure 2, the ETL process can be divided into the following stages:
 Extracting data from various data sources
 Cleaning of data in which it is ensured that important data is not missing, field names are changed
appropriately for good readability, etc.
 Transforming source data in a form suitable for the data warehouse. Transformation of the data
from source databases involves a number of modifications to be done on the data to make it suitable
to be stored in a data warehouse. Various types of transformations that can be carried out include:
 Joining two or more tables
 Dimensionalising data, i.e., converting Entity Relationship (ER) model to multi-dimensional
model
 Normalising data
 Sorting records
 Ranking records
 Aggregating or summarising data
 Changing data types
 Loading data into the target database
 Performing quality assurance checks at every stage of ETL to ensure accuracy, consistency, and
reliability of data. This process is also known as data scrubbing.
The entire ETL process is generally carried out through a GUI. The GUI facilitates:
 Selecting of data sources (flat files, RDBMS, ERP, etc.)
 Carrying out the necessary transformations
 Loading the data on the target database
 Creating metadata repository
 Administrating the ETL process, monitoring the ETL process, scheduling the ETL jobs, and managing
users in a multi-user environment
 Providing the facility to either clean the data (such as range checks, alerting on the missing data,
etc.) or to integrate third party tools for data quality analysis
 Permitting rollback in case major errors are encountered in the ETL process
 Debugging capability
 Facility of error logging
 Generating business intelligence reports using OLAP and query analysis
 Analysing data from different dimensions
 Generating ad hoc queries
7
JGI JAIN
 Mining data
 Performing drill-down and drill-up operations
 Performing what-if analysis
 Generating a standard report and presenting it in graphical and tabular forms.
 Setting alerts to give indications to the managers on the occurrence of important events (e.g., the
sales target for the month has been achieved)
9.5.4 Data Aggregation

Data stored in summarised form is called aggregate and the process of summarising is called
aggregation. In DBMS, the transaction data is stored, which serves as the source data for a data
warehouse. Online analysis of large transaction data takes a lot of time and hence, the queries cannot
be processed fast. To make online analytical processing faster, some important data is stored in the
summarised or aggregated form. Another advantage of using aggregates is that the storage space can
be reduced considerably.
As an example, OLTP systems may store each and every transaction of an ATM. An aggregate can store
the total transactions made by each individual in a month in a summarised form. Alternatively, an
aggregate may store only the total transaction amount on a daily basis without keeping individual
information. How much aggregation is required and what data has to be aggregated is decided at the
time of designing by keeping in mind the needs of business intelligence.
Consider the example of the sales of a company. Assume that for each product, there are three models.
Separate databases are maintained at different showrooms to track the daily transactions and take
the necessary reports at the end of the day and end of the month. Now, the company wants to develop
a data mart and decides to discontinue a particular brand and introduce a new brand in each of the
three products. The manager wants information periodically so that he can take the right decision at
the right time. A simple mechanism for this purpose is to use aggregates from which month-wise sales
of each product can be taken and summarised.
However, note that the manager’s requirements cannot be predicted. To start with, he may first ask
for the summarised data, and then he may go in for details. To change the view of the data to analyse
it in greater detail is called “drill-down”. To change the view of the data to view only the aggregates or
summaries is called “drill-up”.
9.6 COMMON ARCHITECTURE OF DATA WAREHOUSE

There are various data warehouses and corresponding architectures, depending on the exact details of
the situation of an organisation. Three common architectures are as follows:
 Data warehouse architecture (Basic)
 Data warehouse architecture with a staging area
 Data warehouse architecture with a staging area and data marts
8
9.6.1 Data Warehouse Architecture (Basic)

Data going into the data warehouse come from data sources and users. The data can be directly accessed
by the end users, which are derived from several sources. Figure 6 shows the simple data warehouse
architecture:
Data Sources Users

Warehouse
Analysis
Operational
Systems
Metadata
Operational Summary Data Reporting

Systems
Raw Data
Flat Files Mining
Figure 6: Data Warehouse Architecture (Basic)
9.6.2 Data Warehouse Architecture with a Staging Area

The operational data need to be cleaned and processed before storing them in the warehouse. These can
be done programmatically; however, most data warehouses use a staging area that simplifies building
summaries and general warehouse management. Figure 7 shows a data warehouse architecture with
a staging area:
Data Sources Users

Warehouse
Analysis
Operational
Systems
Metadata
Operational Staging Summary Data Reporting

Systems Area
Raw Data
Flat Files Mining
Figure 7: Data Warehouse Architecture with a Staging Area
9
JGI JAIN
Data going into the data warehouse come from data sources (operational systems and flat files) and
users (doing analysis, reporting, and mining). The operational data from the data sources first go to a
staging area.
9.6.3 Data Warehouse Architecture with a Staging Area and Data Marts
Data marts are built to focus on particular functions of business for a specific community within an
organisation. The data warehouse architecture with a staging area and data marts is shown in Figure 8:
Data Sources Data Marts Users

Warehouse
Operational Purchasing Analysis

Systems
Metadata
Staging Summary Data Sales

Operational Reporting
Systems Area
Raw Data
Inventory
Flat Files Mining
Figure 8: Data Warehouse Architecture with a Staging Area and Data Marts
Besides the above-mentioned three types of architecture, data warehouse architecture can also be
categorised into the following:
 Centralised data warehouse: In this type of architecture, queries and applications access data from
the central repository.
 Independent data marts: In this type of architecture, different departments of an organisation
create data mart for their usage independently.
 Federated: In this type of architecture, information sharing and interaction between numbers of
different systems take place. Data is either logically or physically integrated.
 Hub and spoke: This type of architecture is based on the top-down approach in which all the
requirements of the organisation are analysed first and then the development takes place at an
enterprise-level.
 Data mart bus: This type of architecture is based on the bottom-up approach in which individual
process requirements are analysed such as orders placed, deliveries to be done, customer calls or
billing.
9.7 DATA WAREHOUSING DESIGN

A data warehouse is a single data repository that integrates records from many data sources for online
business analytics processing (OLAP). This means that a data warehouse must suit the demands of all
10
business stages across the whole organisation. As a result, data warehouse design is a very complicated,
time-consuming, and hence error-prone process. Furthermore, business analytical functions evolve
with time, resulting in changes in system requirements. As a result, data warehouses and OLAP systems
are dynamic, with a constant design process.
The architecture of a data warehouse differs from the view of materialisation in the industry. It considers
data warehouses to be database systems with specific requirements, such as answering management-
related questions. The goal of the design is to determine how records from various data sources should
be extracted, transformed and loaded (ETL) and arranged in a database known as the data warehouse.
9.8 DATA WAREHOUSING DESIGN STRATEGIES

The design and development process of a data warehouse is to be implemented by keeping in mind the
organisation’s needs and users’ requirements regarding reporting and analysing data. There are two
approaches to the development of a data warehouse:
 Top-down approach
 Bottom-up approach
9.8.1 Top-down Approach

The top-down approach is designed by taking the help of the normalised enterprise data model and
considers complete design and planning. It starts with the complete organisational details and follows
a generic approach to handle data. The disadvantages of using top-down approach are as follows:
 It involves high cost
 Analyses and combining all relevant data sources are very difficult
9.8.2 Bottom-up Approach

The bottom-up approach to design a data warehouse was designed by Ralph Kimball. The basic idea
behind bottom-up approach is to define and code the basic parts first and then these developed parts
are linked together to form a whole system. In this approach, data marts are considered as basic
entities that store data in the form of tables/relations such as dimension tables and fact tables. The
data warehouse is formed by integrating these data marts.
As this approach starts with building small prototypes first, it involves less process and cost. A single
data mart usually models a particular business area of a department such as “Purchases” or “Students
Admissions” in an institute. Data warehouses are then developed by associating these data marts. The
alliance of data marts in a data warehouse is concentrated around the integrated dimensions that
depict the potential coordination “points” between data marts. The genuine alliance of two or more
data marts is then completed by a procedure known as “drill across”. Drill across works by gathering or
summarising data along the keys of the conformed dimensions of every fact contributing in it followed
by a join on the keys of these summarised facts.
11
JGI JAIN
The most important management task ensures the dimensions among data marts are corrected
and can be trusted. Consider the situation in which an alliance of data marts can take place in an
organisation. Suppose we have created a Sales data mart in the Sales department. When the Sales data
mart is completed, the business might choose to take the warehousing activities into the “Production
department” which created the need of developing a Production data mart. For better coordination
between the Sales and Production departments, the Sales data mart and the Production data mart need
to be integrated.
9.9 DATA WAREHOUSING LIFE CYCLE

Data Warehousing is a flow process that gathers and manages structured and unstructured data from
many sources into a centralised repository to make meaningful business decisions. When you have
all of your data in one place, it is simpler to undertake analysis, reporting, and uncover significant
insights at entirely different combination levels. A data warehouse environment contains extraction,
transformation, and loading (ELT) resolution, an OLAP engine, consumer analysis tools, and several
applications that govern the way of acquiring data and providing it to businesses. The phrase data
warehouse life-cycle refers to the processes that a data warehouse system takes from the time it is
constructed and the time it is decommissioned. The Data Warehousing Life Cycle is as shown in Figure 4:
Requirement
Specification
Test and
Data Modelling
Deployment
Data Warehousing ELT

Life Cycle Design and
Maintenance
Development
UI Development OLAP Cubes
Figure 4: Data Warehouse Life Cycle
The description of these elements are as follows:

 Requirement Specification: Business analysts perform the initial phase in the building of the Data
Warehouse. Business Analysts develop business need specification papers in this stage. More than half
of the needs are gathered from the customer, and it takes 3-4 months to get all of the requirements.
After gathering the requirements, the data modeler begins identifying the dimensions, facts, and
12
combinations depending on the needs. We may call this the data warehouse’s overarching plan.
However, this phase is mainly concerned with establishing business requirements and storing them
in the data warehouse.
 Data Modelling: This is the second stage in the Data Warehouse’s development. Data modelling is
the process of visualising data distribution and creating databases by meeting the requirements
for transforming data into a format that can be kept in a data warehouse. For example, when we
begin building a house, we place everything in the exact position given in the plan. Data modelling
is what data warehouses are all about. Data modelling aids in the organisation of data, the creation
of links between data sets, and the establishment of data compliance and security that align with
data warehousing aims. This is the most difficult stage of data warehouse creation. Furthermore,
various data modelling approaches are used by corporations for warehouse design. Data modelling
is often done at the data mart level and then branches out into a data warehouse. It is the logic of
how data is stored in relation to other data. For data warehouses, there are three data models to
choose from:
 Star Schema
 Snowflake Schema
 Galaxy Schema.
 ELT Design and Development: This is the third phase in the Data Warehouse’s development. An ETL
tool, which stands for Extract, Transfer, and Load, may extract data from numerous source systems
and store it in a data lake. An ETL procedure may extract data from the lake, convert it, and put
it into a data warehouse for reporting purposes. ELT tools are required for optimal speeds, good
visualisation, and the ability to establish simple, reproducible and consistent data pipelines between
all current architecture and the new data warehouse. This is when ETL solutions such as SAS Data
Management, IBM Information Server, Hive, and others come into play. A solid ETL process may aid
in the creation of a basic yet functioning data warehouse that is used at all levels of the company.
 OLAP Cubes: This is the fourth phase in the Data Warehouse’s development. An OLAP cube, also
known as a multidimensional cube or hypercube, is a data format that enables quick data analysis
based on the various dimensions that constitute a business challenge. A data warehouse would
extract information from a variety of data sources and formats, such as text files, excel spreadsheets,
multimedia files, and so on. The retrieved data is cleaned and converted before being placed into an
OLAP server (or OLAP cube) and pre-processed for further analysis.
Data processing and analysis are often carried out using a basic spreadsheet, with data values
organised in a row and column format. This is best suited for two-dimensional data. OLAP, on
the other hand, comprises multidimensional data, which is often gathered from disparate and
unconnected sources. Using a spreadsheet is not the best option. The cube will logically and orderly
store and evaluate multidimensional data. Data warehouses are now available as a fully constructed
solution that is flexible and capable of staging a variety of data kinds. OLAP cubes are becoming
obsolete because they cannot provide real-time analysis and reporting, and organisations today
need something with great performance.
13
JGI JAIN
 UI Development: This is the sixth phase in the Data Warehouse’s development. So far, the processes
outlined have occurred on the backend. There is a requirement for a user interface for how an user
and a computer system interact, namely the use of input devices and software to quickly access the
data warehouse for analysis and report generation. The primary goal of a UI is to allow an user
to successfully control a device or system with which they are engaging. There are several tools
available on the market to assist with UI creation. For individuals that use Big Query, BI solutions
such as Tableau or PowerBI are excellent options.
 Maintenance: This is the sixth phase in the Data Warehouse’s development. During this phase, we
can update or alter the schema as well as the application domain or needs of the data warehouse.
Data warehouse maintenance solutions must also provide a way to track schema changes, such as
updates. We may execute Insertion operations and alter dimensions and categories at the schema
level. Adding or removing user-defined properties is an example of a change.
 Test and Deployment: This is frequently the final stage in the Data Warehouse development cycle.
Businesses and organisations test data warehouses to see whether or not the needed business
problems are effectively implemented. Warehouse testing necessitates the examination of massive
amounts of data. Data that must be compared come from a variety of sources, including relational
databases, flat files, operational data, and so on. The following steps are included in the entire data
warehouse project testing phases: data completeness, data transformation, data loading using ETL
tools, data integrity, and so on. We deployed the data warehouse after testing it so that users could
quickly access and analyse the data. Essentially, at this phase, the data warehouse is activated and
made available to the user. The majority of the data warehouse’s functions are implemented at the
time of deployment. The data warehouses may be installed in-house or on the cloud.
9.10 DATA WAREHOUSE AND DATA MARTS

While studying data warehouses, you must have come across the term data mart. Most people who have
little knowledge about data warehousing use the terms data warehouse and data mart synonymously.
Even some authors and vendors often make the same mistake. Let us first discuss these terms.
A data mart is a collection of datasets that support departments in making specific decisions. For
example, in an organisation, both marketing and sales departments have their own data marts. Each
department of an organisation completely owns the software, hardware, and other components that
form its data mart. As the data marts of every department are different from each other, it becomes
difficult even to manage and organise data across different departments of the same organisation.
Unlike a data mart, a data warehouse spans the entire organisation and does not belong to a particular
department. Data warehouses store more information and have a higher level of granularity as
compared to data marts, which have more summarised data and fewer granularities.
Data marts that store data required for specific business processes or departments are built from
the data warehouse. According to Bill Inmon, a data warehouse is in the middle of the “Corporate
Information Factory” (CIF), which provides a logical framework for providing business intelligence and
business management skills.
14
According to many vendors, data warehouses are not easy to build and are expensive. They make
you believe that creating data warehouses is just a wast of time. However, this is not accurate. These
vendors believe that data warehouses are obstacles that stop them from earning profits. They would tell
you about all the drawbacks of a data warehouse that you may encounter while implementing it. Some
vendors might also suggest you build a data warehouse by building a few data marts and let them grow.
9.10.1 Types of Data Marts

As discussed earlier, the terms ‘data warehouse’ and ‘data mart’ have often been used synonymously;
however, there are marked differences between the two.
Depending on the functionality, there are different types of data marts:

 Stand-alone Data Mart: A data mart that does not interact with other data marts is called a stand-
alone data mart.
 Multi-source Data Mart: A data mart for which the input data is obtained from multiple sources is
called a multi-source data mart.
 Personal Data Mart: A data mart that is created to be used by specific individuals, such as Chief
Executive Officer (CEO), Chief Technology Officer (CTO) or Chief Financial Officer (CFO) is called a
personal data mart.
Apart from data marts, there is also an operational data store that provides data to a data warehouse.
Operational data store is a database system that obtains data from different sources, consolidates it,
and keeps it at a single location. However, it does not have the sophistication of a data mart or a data
warehouse for analytical processing. It is generally developed as a pre-cursor for ERP systems.
 A data warehouse is a database management system that facilitates online analytical processing
by allowing the data to be viewed in different dimensions or perspectives, to provide business
intelligence.”
 Data warehouse can also be referred as a collection of data designed to support decision-making
process of the management.
 Data warehouses are designed to help us examine data for supporting the decision-making process
of the management.
 A data warehouse obtains data from a number of operational database systems, which can be based
on RDBMS or ERP packages.
 The data from various data sources is converted into a form suitable to be stored in a data warehouse.
This process is called Extract, Transform and Load (ETL) of data into the target database.
 Data staging is the process of transferring data from various data sources (operational systems)
into the target database of the data warehouse using the ETL process.
 Data stored in a summarised form is called aggregate and the process of summarising is called
aggregation.
15
JGI JAIN
 Online Transaction Processing (OLTP) systems may store each and every transaction of an ATM.
 A data warehouse is a single data repository that integrates records from many data sources for
online business analytics processing (OLAP).
 The design and development process of a data warehouse is to be implemented by keeping in mind
the organisation’s needs and users’ requirements regarding reporting and analysing data.
 The top-down approach is designed by taking the help of the normalised enterprise data model and
considers complete design and planning.
 The basic idea behind bottom-up approach is to define and code the basic parts first and then these
developed parts are linked together to form a whole system.
 Data Warehousing is a flow process that gathers and manages structured and unstructured data
from many sources into a centralised repository to make meaningful business decisions.
 A data mart is a collection of datasets that support departments in making specific decisions.
9.12 GLOSSARY
 Data warehouse: A database management system that facilitates online analytical processing
intelligence.
 ETL: The process in which the data from various data sources is converted into a form suitable to be
stored in a data warehouse
 Data staging: The process of transferring data from various data sources (operational systems) into
the target database of the data warehouse using the ETL process
 Data mart: A collection of datasets that support departments in making specific decisions
 Multi-source data mart: A data mart for which the input data is obtained from multiple sources
1. Discuss the concept of a data warehouse.

2. Explain the different types of data marts.
3. Discuss the concept of data warehouse architecture.
4. List down the common architecture of data warehouse.
5. Explain the features of the data warehouse.
16

1. A data warehouse is a database management system that facilitates online analytical processing
intelligence. Data warehouse can also be referred as a collection of data designed to support
decision-making process of the management. Refer to Section Introduction
2. Depending on the functionality, there are different types of data marts:
 Stand-alone Data Mart: A data mart that does not interact with other data marts is called a
stand-alone data mart.
Refer to Section Data Warehouse and Data Marts
3. A data warehouse obtains data from a number of operational database systems, which can be based
on RDBMS or ERP packages. The data from various data sources is converted into a form suitable to
be stored in a data warehouse. This process is called Extract, Transform and Load (ETL) of data into
the target database. Refer to Section Data Warehouse Architecture
4. There are various data warehouses and corresponding architectures, depending on the exact details
of the situation of an organisation. Three common architectures are as follows:
 Data warehouse architecture (Basic)
 Data warehouse architecture with a staging area
 Data warehouse architecture with a staging area and data marts
Refer to Section Common Architecture of Data Warehouse
5. The features of a data warehouse are as follows:
 Subject-oriented: Data warehouses are designed to help us examine data for supporting decision-
making process of the management. For example, learning about sales data of a particular
organisation, a warehouse can be built that concentrates on sales.
Refer to Section Features of a Data Warehouse
 https://www.oracle.com/in/database/what-is-a-data-warehouse/#:~:text=A%20data%20
warehouse%20is%20a,large%20amounts%20of%20historical%20data.
 https://cloud.google.com/learn/what-is-a-data-warehouse
17
JGI JAIN
 Discuss with the classmates about the data warehouse. Also, discuss the various architecture of
data warehouse.
18
UNIT
10 Dimensional Modelling
Names of Sub-Units
Dimensional Model, The Star Schema, Query Execution Process, The Snowflake Schema, Fact Tables
and Dimension Tables, Factless Fact Table, Updates to Dimension Tables, Primary Keys, Surrogate
Keys and Foreign Keys, Aggregate Tables, Fact Constellation Schema or Families of Star
Overview
The unit begins by discussing the concept of dimensional model. Next, the unit explains the star
schema, query execution process, and the snowflake schema. Further, the unit discusses the fact
tables, dimension tables, and factless fact table. The unit also discusses the updates to dimension
tables. The unit then discusses the primary keys, surrogate keys and foreign keys. Towards the end, the
unit explains the aggregate tables as well as the fact constellation schema or families of star.
Learning Objectives

 Describe dimensional model and query execution process
 Explains the fact tables, dimension tables, and factless fact table
 Defines how to updates to dimension tables
 Explains the primary keys, surrogate keys and foreign
 Describe the aggregate tables as well as the fact constellation schema or families of star
JGI JAIN
Learning Outcomes

 Assess the knowledge dimensional model and query execution process
 Examine the fact tables, dimension tables, and factless fact table
 Analyse how to updates to dimension tables
 Evaluate the concept the primary keys, surrogate keys and foreign
 Assess the knowledge about aggregate tables as well as the fact constellation schema or families
of star
 https://www.sciencedirect.com/topics/computer-science/dimensional-model
10.1 INTRODUCTION
Dimensional modelling (DM) is a data structure approach that has been improved for data storage in a
data warehouse. The goal of dimensional modelling is to improve the database for quicker data retrieval.
Ralph Kimball created the dimensional modelling idea, which comprises “fact” and “dimension” tables.
A dimensional model for a data warehouse has been used to analyse, consolidate and assess numeric
data. In a real-time Online Transaction System, however, relation models are suitable for inserting,
modifying, and removing files. These dimensional and relational models each have their own approach
to storing data, which offers distinct advantages. In the relational model, for example, normalisation
and ER models eliminate data redundancy. A dimensional model in a data warehouse, on the other
hand, organises data in such a way that it is easy to access information and produce reports.
As a result, dimensional models are utilised in data warehouse systems and are not well suited to
relational systems.
10.2 DIMENSIONAL MODEL

Dimensional modelling uses a cube procedure to represent data, rendering OLAP data management
appropriate for logical data representation. Ralph Kimball created the Dimensional Modelling
perception, which consists of “fact” and “dimension” tables.
In data warehouses, the dimensional method includes the use of fact and dimension tables to keep track
of past data. Different data modelling strategies are better suited to specific purposes. Normalised
ER models are designed to remove data redundancy and to conduct Add, Modify and Remove actions
rapidly, as well as to get data into a database.
2
UNIT 10: Dimensional Modelling JGI JAIN
10.2.1 Dimensional Model Vs. ER Model

Table 1 shows the different between dimensional model and ER model:
Table 1: Dimensional Model Vs ER Model
Dimensional Model ER Model
It is used for dimensional modelling. It is used for relational modelling.
Data is stored in RDBMS or Multidimensional Data is stored in RDBMS.

databases.
Cubes are units of storage. Tables are units of storage
Data is de-normalised and used in data warehouse Data is normalised and used for OLTP.
and data mart.
Optimised for OLAP. Optimised for OLTP.
Few tables and fact tables are connected to dimensional Several tables and chains of relationships among
tables. them.
Non-volatile and time invariant. Volatile (several updates) and time variant.
Summary of bulky transactional data (Aggregates Detailed level of transactional data.

and Measures) used in business decisions.
MDX is used to manipulate data. SQL is used to manipulate data.
It is used for user friendly, interactive, drag and drops It is used for normal reports.
multidimensional OLAP Reports.
One fact table for data organisation. One table per entity.
Maximise understandability. Minimise data redundancy.
Optimised for retrieval. Optimise update.
The data warehousing model. The transaction processing model.
10.3 THE STAR SCHEMA

The star schema is the most basic kind of Data Warehouse schema. It is referred to as a star schema
because its structure resembles a star. The core of the star in the Star schema can have one fact table
and a number of linked dimension tables. It is also known as the Star Join Schema, and it is designed to
query massive data sets.
3
JGI JAIN
Figure 1 shows the star schema:
Dimension Table Dimension Table
Dealer Date Dim

Dealer_ ID Date _ ID
Location_ ID Year
Country_ ID Month
Dealer_ NM Fact Table Quarter
Dealer_ CNTCT Date
Revenue
Dealer_ ID
Model_ ID
Branch_ ID
Date_ ID
Dimension Table Units_ Sold Dimension Table
Revenue
Branch Dim Product

Branch _ ID Product_ ID
Name Product_ Name
Address Model_ ID
Country Variant_ ID
Figure 1: Star Schema
For example, as shown in the figure above, the fact table is in the center and has keys to each dimension
table such as, Deal_ID, Model_ID, Date_ID, Product_ID, Branch_ID and other attributes such as Units
sold and revenue.
The characteristics of the star schema are as follows:

 A single-dimension table is used to represent each dimension in a star schema.
 The dimension table must include the collection of characteristics.
 A foreign key connects the dimension table to the fact table.
 The dimension tables are not connected to one another.
 The fact table would include keys and measurements.
 The Star structure is simple to grasp and optimises disc consumption.
 There is no normalisation of the dimension tables. For example, in the following image, Country_ID
does not contain a Country lookup table, as would be expected in an OLTP architecture.
 BI Tools commonly support the schema.
4
10.4 THE SNOWFLAkE SCHEMA

A snowflake schema is an expansion of a Star Schema that adds dimension. It’s named a snowflake
because the diagram looks like a snowflake. The dimension tables have been standardised, resulting in
the data being divided into extra tables. The country is further standardised into an individual table in
the following example. Figure 2 shows the snowflake schema:
Dimension Table Dimension Table Dimension Table
Location Dealer Date Dim

Location_ ID Dealer_ ID Date _ ID
Region Location_ ID Year
Country_ ID Month
Dealer_ NM Fact Table Quarter
Dimension Table
Dealer_ CNTCT Date
Revenue
Country Dealer_ ID
Country _ ID Model_ ID
Country_ Branch_ ID
Name Dimension Table Date_ ID Dimension Table Dimension Table
Units_ Sold
Branch Dim Revenue Product Variant
Branch _ ID Product_ ID Variant_ ID
Name Product_ Name Variant_ Name
Address Model_ ID Fuel type
Country Variant_ ID
Figure 2: Snowflake Schema
The characteristics of the snowflake schema are as follows:

 The major advantage of snowflake architecture is that it requires less disc space.
 A dimension that is easier to implement is added to the Schema.
 Because of the many tables, query performance suffers.
 The biggest problem you will have while adopting the Snowflake Schema is that you will need to
undertake additional maintenance efforts due to the increased number of lookup tables.
10.5 QUERY EXECUTION PROCESS

SQL Server’s Relational Engine handles query processing. It is the process of turning the T-SQL
statements you create into something that can make requests to the Storage Engine and receive the
required results.
To process a query, SQL Server goes through three steps: parsing, optimising and execution. The
optimised plan is scheduled as a consequence of the third phase, during which calls are made to the
Storage Engine to obtain the data that form the results of the query you are running.
5
Figure 3 shows the query execution process:
SQL Query
Parser/Translator
Database
Optimiser
Catalog
Execution Engine Data
Figure 3: Query Execution Process
Different stages of the query execution process are as follows:

 Parsing: SQL Server conducts simple checks on the source code during the parsing step. This parser
checks for inappropriate SQL syntax, semantics and shared pool: After turning the query into
relational algebra, the database runs the following tests during the parse call:
 Check your syntax: SQL syntactic validity is concluded.
 Check for semantics: Assesses whether or not the assertion is true.
 Check the shared pool: During the execution of each query, a hash code is generated. This check
checks whether written hash code exists in the common pool; if code is present in the shared
pool, the database would not perform any extra optimisation or processing steps.
 Optimiser: During the optimisation stage, the database must do a hard parse for at least one unique
Data manipulation language statement and optimise it. This database does not optimise DDL unless
it has a DML element that requires optimisation, such as a subquery. It is a procedure in which
numerous query execution plans for fulfilling a request are compared and the most efficient query
plan is chosen for execution.
 Execution engine: Finally, execute the query and show the desired outcome.
10.6 FACT TABLES

A fact table is a core table in a data warehouse’s star structure. A fact table holds quantitative data for
analysis and is frequently denormalised. A fact table can be used in conjunction with dimension tables.
A fact table stores the data to be studied, whereas a dimension table keeps the information about how
the data in the fact table may be evaluated. As a result, the fact table has two sorts of columns. The
foreign keys column facilitates connects with dimension tables, while the measures columns store the
data under investigation.
6
Assume a company provides services to customers. Every sale is a fact that happens, and all these facts
are documented in the fact table. For example:
Time ID Product ID Customer ID Unit Sold

4 17 2 1
8 21 3 2
8 4 1 1
Now, we can add a dimensional table about customers:
Customer ID Name Gender Income Education Region

1 Brian Edge M 2 3 4
2 Fred Smith M 3 5 1
3 Sally Jones F 1 7 3
In this case, the fact table’s customer ID column is the foreign key that connects to the dimension table.
Following the linkages reveals that row 2 of the fact table captures the information that customer 3,
Sally Jones, purchased two goods on day 8. The corporation would also have a product table and a
timeline to establish what Sally purchased and when she purchased it.
There are physical and data constraints when creating fact tables. The eventual size of the item, as
well as access pathways, should be taken into account. Adding indexes can assist with both of these
issues. However, there should be no constraints in terms of logical design. Tables should be constructed
with current and future needs in mind, with as much flexibility as possible built in to allow for future
expansions without having to recreate the data.
10.7 DIMENSIONAL TABLE

A dimension table is a table that maintains qualities or dimensions, that describe the objects in a fact
table. In dimension tables, descriptive qualities are organised as columns in a data warehouse.
For example, the characteristics of a customer dimension may contain first and last name, birth date,
gender and so on, whereas the attributes of a website dimension would comprise the site name and URL.
A primary key column in a dimension database uniquely identifies each dimension record (row).
Customer ID Name Gender Income Education Region

1 Brian Edge M 2 3 4
2 Fred Smith M 3 5 1
3 Sally Jones F 1 7 3
Fact tables use keys to relate to dimension tables. A system-generated key is used to uniquely identify a
row in a dimension table when it is created in a data warehouse. This key is sometimes referred to as a
surrogate key.
7
JGI JAIN
In the dimension table, the surrogate key serves as the primary key. The surrogate key is assigned to
the fact table, and a foreign key is established between the two tables. When the data is connected, it is
joined in the same way as any other join in the database.
10.8 FACTLESS FACT TABLE

Factless tables merely refer to the fact that no treatments are accessible. Factless fact tables are solely
used to create correlations between different dimensions of items. Tables may also be used to describe
events and coverage, as they provide information that nothing has transpired. It is frequently used to
illustrate many-to-many connections.
The only thing they have is a shortened key. They continue to reflect focused phenomena that are
recognised by the dimension table combination. Figure 4 shows the factless fact table:
Data Dimensions Data Key Account Dimensions
Employee Key
Employee Dimensions
Customer Key
Account Key
Customer Dimensions
Loan Key Loan Dimensions
Figure 4: Factless Fact Table
Factless tables are classified into two types:

 Tables for Event Tracking: Track events of importance to the organisation using a factless fact
table. Attendance at a cultural event, for example, maybe monitored by establishing a fact table
with the following foreign keys (i.e., links to dimension tables) Speaker/entertainment identification,
participant identifier, event type; Date. The information in this table may then be searched for,
such as the most popular ones. Which cultural program or type of program. The example below
demonstrates a factless fact table that records each time a student attends a course or which class
has the highest attendance. Or, what is the average number of people that attend a certain course?
All questions are COUNT () based with group BY questions. So, we may start with counting and
subsequently develop other aggregate functions such as Aggress, Max and Min.
 Tables of Coverage: Ralph refers to the second form of factless fact table as a coverage table. It is
employed to back up unfavourable analytical reports. To build a report stating a store did not sell
a product for a given period of time, for example, you need to have a fact table that captures all
potential combinations. Then you’ll be able to figure out what’s lacking.
8
Examples of factless fact tables:

 Ex-office visitors.
 People on a list for online clicks.
 Keeping track of student attendance and registration activities.
10.9 UPDATES TO DIMENSION TABLES

As more and more sales occur each day, more and more rows are added to the fact table. Updating
owing to a change in the fact table occurs quite seldom. When compared to fact tables, dimension tables
are more stable. Dimension table changes as a result of attribute modifications, but not as a result of an
increase in the number of rows.
The following are the ways to update the dimensional tables:

 Slowly Changing Dimensions: Dimensions are usually stable throughout time, but if they aren’t,
they vary slowly. The record’s customer ID remains the same, but the client’s marital status or
location may change over time.
When such a change in attribute values occurs in an OLTP system, the old values replace the new
ones by overwriting the old ones. However, in a data warehouse, overwriting characteristics is not
a viable option because previous data for analysis is constantly necessary.
There are three sorts of adjustments that may be made to characteristics:
 Type 1 Changes: This method replaces an existing value with a new one and does not keep track
of previous values.
 Type 2 Changes: This approach creates a new row for the updated value while keeping the old
row for reporting and historical purposes.
 Type 3 Changes: This method adds a new current value column to an existing record while
keeping the old one.
 Large Dimension Tables: Large dimension tables are both deep and broad, with deep implying a
large number of rows and wide implying a large number of characteristics or columns.
To deal with enormous dimensions, one can extract several small dimensions from a larger
dimension based on their interest. These mini-dimensions can be seen as a star schema.
Take the previously indicated order analysis. A star schema is one of a manufacturing company’s
mini-dimensions in which the marketing department is interested.
Customers and products are often enormous in size. Because of their bulk, large dimensions are often
sluggish and inefficient. They frequently use numerous hierarchies to conduct OLAP operations
such as drill-down and roll-up.
 Rapidly Changing or Large Slowly Changing Dimensions: For type 2 updates, a new row is produced
with the altered attribute’s new value. This retains the history of characteristics or their former
values.
If some attribute is changed again, a new dimension table row is formed inside the new value.
9
JGI JAIN
This is viable if the size varies only occasionally, such as once or twice a year. For example, the product
dimension, which includes thousands of rows, changes seldom, making it manageable. However, in
the case of customer dimensions, where the number of rows is in the millions and changes seldom,
type 2 updates are doable and not very complex. Type 2 modifications are problematic and difficult
when client dimensions change quickly.
If the dimension table is rapidly changing and huge, divide it into one or smaller dimension tables.
10.10 PRIMARY kEYS

A primary key is a column in a database table that uniquely identifies each row/record. Primary keys
must have distinct values. NULL values are not permitted in a primary key column.
A table can only have one primary key, which might be a single or several fields. Composite keys are
utilised when numerous fields are used as the main key.
If a table has a primary key declared on any field(s), no two records can have the same value in that field
(s). Keep in mind that you will apply these principles while designing database tables.
The following example shows the use of the primary key:

CREATE TABLE Student (
Student_ID int NOT NULL PRIMARY KEY,
Student_Name varchar(20) NOT NULL,
Student_Father_Name VARCHAR(20) NOT NULL);
10.11 SURROGATE kEYS AND FOREIGN kEYS

Surrogate key, also known as a synthetic primary key, is created when a database automatically inserts
a new record into a table that may be defined as the main key of that table. It is a sequential number
that exists outside of the database and is made available to the user and the program, or it acts as an
item that exists in the database but is not visible to the user or the application.
If a table lacks a natural primary key, we must generate one artificially to uniquely identify a row in the
database; this key is known as the surrogate key or synthetic primary key of the table. The surrogate key,
however, is not usually the main key. If numerous items in a database are linked to the surrogate key, we
will have a many-to-one relationship between the primary keys, and the surrogate key and surrogate
key cannot be used as the main key.
The features of the surrogate key:

 It is created automatically by the system.
 It is used to store an anonymous integer.
 It has a unique value for each record in the table.
 The user or program can never change the value.
 The factless key is referred to as a surrogate key since it is added solely for the purpose of identifying
unique values and includes no meaningful fact (or information) that is valuable for the table.
10
A foreign key is a key that connects two tables. This is also known as a reference key. A foreign key is a
column or set of columns whose values match those of a Primary Key in another table. The connection
between the two tables aligns the Primary Key in one table with the Foreign Key in the other. If a table
has a primary key declared on any field(s), no two records can have the same value in that field (s).
To understand the concept of the foreign key, you create the following:
CREATE TABLE EMP (
Emp_ID INT NOT NULL PRIMARY KEY,

Emp_Name VARCHAR (20),
Emp_Salary VARCHAR (20));
The following example shows the use of the foreign key:
CREATE TABLE DEPT(
Dept_ID INT NOT NULL PRIMARY KEY,
Dept_Name VARCHAR(20),
Emp_ID INT NOT NULL,
FOREIGN KEY(Emp_ID) REFERENCES employee(Emp_ID));
10.12 AGGREGATE TABLES

In general, aggregate tables are just database tables that hold aggregated values. Aggregate tables
are tables that hold existing warehouse data that has been aggregated by a set of dimensions. The
aggregated tables are easier to access data from than the original tables with much more information.
This table decreases the burden on the database server and improves query processing.
Data are summarised in the aggregate table. Aggregated tables make up the materialised view. For
example, in sales, we only have date transactions. Suppose we want to produce a report that shows
annual sales by product. We group the data values into the week, month, quarter and year in such
circumstances. The @aggrtegate function is used to obtain dates from these tables.
It provides aggregated data that can be computed using various aggregated functions such as count,
avg, min, max and so on. In an OLAP database, aggregated tables are the most commonly utilised tables.
Aggregate functions are functions that combine the values of a table or column into a single value.
The information will be placed into the aggregate table with the help of aggregate functions. The
aggregate tables are used for scalability, and the information is sent quickly. Aggregations implemented
at the database level enhance the query performance by hitting the aggregate table instead of the table
immediately and fetching data.
For example, if a table includes data from the years 2020 and 2021, and the customer needs an exact
figure of records in the table periodically, such as, weekly, monthly, and annually, we must establish
an aggregate table that stores the monthly count. For quarterly, we must create a separate table and
squeeze the quarterly count into it; and we must use those tables in reports so that report performance
improves dramatically.
11
JGI JAIN
10.13 FACT CONSTELLATION SCHEMA OR FAMILIES OF STAR

A fact constellation is made up of two or more fact tables that share one or more dimensions. It is also
known as the Galaxy schema.
The logical structure of a data warehouse or data mart is described by the Fact Constellation Schema.
Fact Constellation Schema may be used to create designs using a collection of de-normalised FACT,
Shared and Conformed Dimension tables.
Figure 5 shows the fact constellation schema:
Fact Table I Dimension Table Fact Table II
Business results Product Business Forecast
Prod_ no
Product Product
Prod_ name
Quarter Future_ qtr
Prod_ descr
Region Region
Prod_ style
Revenue Projected_ revenue
Prod_ line
Figure 5: Fact Constellation Schema
The Fact Constellation Schema is a complex database design that makes it difficult to summarise
information. Fact Constellation Schema may be used to implement between aggregate Fact tables or to
deconstruct a complicated Fact table into isolated simplex tables. Tables of data.
10.13.1 Families of Stars

When you look at a single STAR schema with its fact table and surrounding dimension tables, you
can see that it does not represent the full scope of a data warehouse. STAR schema structures may be
found in almost all data warehouses. Each STAR has a distinct function in tracking the measurements
provided in the fact table. When you have a group of connected STAR schemas, you may refer to the
group as a family of STARS. STARS families form for a variety of reasons. To support the aggregates,
you may create a family by simply adding aggregate fact tables and derived dimension tables. You may
occasionally construct a core fact table containing information of relevance to the majority of users and
personalise fact tables for individual user groups. Many variables contribute to the existence of STARS
families.
Dimension tables are shared by the fact tables of the STARS in a family. The time dimension is usually
shared by the majority of the fact tables in the group. The time dimension is likely to be shared by all
three fact tables in the above example. In the other direction, dimension tables from many STARS may
share a fact table from a single STAR.
12
UNIT 10: Dimensional Modelling JGI JAINDEEMED-TO-BE UNIVERSIT Y
Figure 6 shows the families of stars:
Dimension Dimension Dimension

Table Table Table
Fact
Table
Dimension Fact
Dimension
Table Table
Table
Fact
Table
Dimension
Table
Dimension Dimension
Table Table
Figure 6: Fact Constellation Schema
It makes sense to collect individual transactions as well as snapshots at defined intervals if you are in a
business such as banking or telephone services. Then you may utilise STARS families that are made out
of STARS.
 Dimensional Modelling (DM) is a data structure approach that has been improved for data storage
in a data warehouse.
 A dimensional model for a data warehouse has been used to analyse, consolidate, and assess
numeric data.
 A dimensional model in a data warehouse organises data in such a way that it is easy to access
information and produce reports.
 Dimensional modelling uses a cube procedure to represent data, rendering OLAP data management
appropriate for logical data representation.
 In data warehouses, the dimensional method includes the use of fact and dimension tables to keep
track of past data.
 The star schema is the most basic kind of Data Warehouse schema. It is referred to as a star schema
because its structure resembles a star.
13
JGI JAIN
 A snowflake schema is an expansion of a Star Schema that adds dimension. It’s named a snowflake
because the diagram looks like a snowflake.
 A fact table is a core table in a data warehouse’s star structure. A fact table holds quantitative data
for analysis and is frequently denormalised.
 A dimension table is a table that maintains qualities, or dimensions, that describe the objects in a
fact table.
 Factless tables merely refer to the fact that no treatments are accessible.
 Updating owing to a change in the fact table occurs quite seldom. When compared to fact tables,
dimension tables are more stable.
 A primary key is a column in a database table that uniquely identifies each row/record. Primary
keys must have distinct values.
 Surrogate key, also known as a synthetic primary key, is created when a database automatically
inserts a new record into a table that may be defined as the main key of that table.
 A foreign key is a key that connects two tables. This is also known as a reference key.
 In general, aggregate tables are just database tables that hold aggregated values.
 Aggregate tables are tables that hold existing warehouse data that has been aggregated by a set of
dimensions.
 A fact constellation is made up of two or more fact tables that share one or more dimensions. It is
also known as the Galaxy schema.
10.15 GLOSSARY
 Dimensional modelling (DM): A data structure approach that has been improved for data storage
in a data warehouse
 Star schema: The most basic kind of Data Warehouse schema because its structure resembles a star
 Snowflake schema: An expansion of a Star Schema that adds dimension
 Fact table: It holds quantitative data for analysis and is frequently denormalised
 Dimension table: A table that maintains qualities or dimensions, that describe the objects in a fact
table
 Primary key: A column in a database table that uniquely identifies each row/record
 Surrogate key: A key that is created when a database automatically inserts a new record into a
table that may be defined as the main key of that table
 Aggregate tables: Tables that hold existing warehouse data that has been aggregated by a set of
dimensions
14
1. What exactly is dimensional modelling?

2. What is the snowflake schema?
3. What do you understand by the aggregate table?
4. Explain the concept of dimensional table.
5. What is the difference between the dimensional model and ER model?

1. Dimensional modelling uses a cube procedure to represent data, rendering OLAP data management
appropriate for logical data representation. Ralph Kimball created the Dimensional Modelling
perception, which consists of “fact” and “dimension” tables Refers to Section Dimensional Model
2. A snowflake schema is an expansion of a Star Schema that adds dimension. It’s named a snowflake
because the diagram looks like a snowflake. The dimension tables have been standardised, resulting
in the data being divided into extra tables. Refers to Section The Snowflake Schema
3. In general, aggregate tables are just database tables that hold aggregated values. Aggregate tables
are tables that hold existing warehouse data that has been aggregated by a set of dimensions.
The aggregated tables are easier to access data from than the original tables with much more
information. This table decreases the burden on the database server and improves query processing.
Refers to Section Aggregate Tables
4. A dimension table is a table that maintains qualities, or dimensions, that describe the objects in a
fact table. In dimension tables, descriptive qualities are organised as columns in a data warehouse.
Refers to Section Dimensional Table
5. The different between the dimensional model and ER model is shows in the following table:
Dimensional Model ER Model

It is used for dimensional modelling It is used for relational modelling
Data is stored in RDBMS or Multidimensional databases Data is stored in RDBMS
Cubes are units of storage Tables are units of storage
Refers to Section Dimensional Model
15
 https://www.astera.com/type/blog/dimensional-modeling-guide/
 https://www.holistics.io/books/setup-analytics/kimball-s-dimensional-data-modeling/
 Discuss with your friends about dimensional modelling. Also, discuss the real-life examples of
dimensional modelling.
16
UNIT
11 Online Analytical Processing – OLAP
Names of Sub-Units
Need for OLAP, OLTP vs. OLAP, OLAP Operations: Roll-up, Drilldown, Slice, Dice, Pivot, OLAP Models:
ROLAP, MOLAP, HOLAP, and DOLAP
Overview
This unit begins by discussing the concept of online analytical processing (OLAP). Then, the unit
discusses the need for OLAP. Further, the unit explains the difference between OLTP and OLAP. Next,
the unit discusses the OLAP operations. Towards the end, the unit discusses the OLAP models.
Learning Objectives

 Describe the concept of online analytical processing (OLAP)
 Explains the need for OLAP
 Defines difference between OLTP and OLAP
 Explains the OLAP operations
 Describe the OLAP models
JGI JAIN
Learning Outcomes

 Assess the knowledge online analytical processing (OLAP)
 Examine the need for OLAP
 Analyse the difference between OLTP and OLAP
 Evaluate the operations of the OLAP
 Assess the knowledge about OLAP models
 http://web.mit.edu/profit/PDFS/SlaughterA.pdf
11.1 INTRODUCTION
To attain informed business decisions and take the required actions, business analysts and executives
often need to analyse business data from different aspects. Consider the case of a business organisation
that sells different household products across different regions of the country. To arrive at an appropriate
decision about increasing the sale of a particular product, business executives need to analyse the data
according to the market trends and popularity of the product in various regions. The business executives
may also need to compare the marketing statistics of a product during different seasons of a year or a
particular period in different years. To perform the comparison, the business data needs to be stored in
multidimensional databases. The selective extraction and analysis of the data are then accomplished by
using Online Analytical Processing (OLAP) method.
OLAP is a processing approach that performs multidimensional analysis of business data. It also
facilitates the performance of complex calculations, trend analysis and complicated data modeling.
OLAP is different from Online Transaction Processing (OLTP), which is another processing approach
characterised by various online transactions, including INSERT, UPDATE and DELETE. OLTP processing
approach gives very fast query processing and helps in maintaining data integrity in multi-access
environments.
Before starting with the OLAP study in detail, we must first understand the need for OLAP. The chapter
starts on this note and then it clearly indicates the difference between OLAP and Online Transaction
Processing (OLTP). It introduces how OLAP is efficient in multidimensional analysis and talks about
hypercubes. Finally, the chapter talks about various popular OLAP tools in use.
2
UNIT 11: Online Analytical Processing – OLAP JGI JAIN
11.2 NEED FOR ONLINE ANALYTICAL PROCESSING

OLAP is a popular analytical data processing tool used in every industry. With the oversizing of market
data arises the need for data analysis. As every transaction gets recorded in a computer, the rate of
data production in turn increases. A typical OLTP system may record thousands of transactions in a
day. Effectively, in a month or year, the data produced is huge. If any company is storing the data for a
long duration, then there must be some way to make use of this stored data; hence the requirement of
the OLAP i.e., data analysis.
As the user required an interactive interface for complex analysis or aggregate calculations, it was the
need of the hour to perform the multidimensional analysis effectively. Hence, the need for OLAP arose.
The following points depict the need for OLAP in organisations:

 Tool for analysis of huge data i.e., billions of rows or petabytes of data was in demand. The analysis
must be fast and should be interactive.
 There was a need for multidimensional analysis as data was coming from various parts of an
enterprise or data marts. For example, data was getting generated from the purchasing department,
sales department, stores etc. Every department has its way of storing the data and some important
attributes related to it. Thus, we need a combined analysis with respect to all departments or some
departments with their important attributes.
 While performing multidimensional analysis, traditional tools like report writers, query products,
spreadsheets and language interfaces fail to match the user’s expectations. Complex calculations
are not satisfactorily performed by these tools.
 Standard tools can be used for analysis; but when data grows in size, the performance gets hampered.
It requires more time for complex analysis over a set of varied attributes from different dimensions.
 Complex analysis using OLAP is useful to senior management of various companies to forecast
future trends.
11.3 OLTP VS. OLAP

The architecture of OLAP for multidimensional cubes differs significantly from that of OLTP (Online
Transactional Processing) for databases. To facilitate everyday processing in an organisation, OLTP is
implemented in a relational database. The primary role of an OLTP system is to collect data and store it
in computers. OLTP enables effective data manipulation and storage for everyday activities, resulting
in a massive amount of transactional data. Organisations construct several OLTP systems to manage
massive amounts of daily operations transactional data in a short period of time.
OLAP is intended for data access and analysis to aid management users in strategic decision-making.
The goal of OLAP technology is to aggregate datasets into multidimensional displays while maintaining
system performance. OLTP systems are “customer-oriented,” whereas OLAP systems are “market-
oriented.”
3
JGI JAIN
Table 1 summarises the difference between OLTP and OLAP:
Table 1: Difference between OLTP and OLAP
Differences OLTP OLAP

Characteristics Can handle a high volume of small Handles large volumes of data.
online transactions.
Query Simple queries, such as inserting, Complex queries which require aggregations.
deleting and updating data.
Database Design Normal, with a lot of tables. Typically, fewer tables are used, and star or
snowflake schemas can be used.
Method Traditional DBMS is used. Uses data warehouses.
Sources The OLTP and its associated OLAP data is derived from various OLTP
transactions correlate to the data databases.
sources.
Data Quality A significant amount of effort has The data may not be as well organised, but
been expended to ensure that the what is truly important is the ability to traverse
data is ACID-compliant. across the data’s dimensions.
Functionality Online database which modifies a Online database query management system
system by controlling and running that allows users to discover hidden insights,
essential business tasks in real- plan support decisions and solve problems.
time.
Speed Processing time is typically very Depends on the amount of data. Creating
short. indexes can boost query speed.
Backup and Recovery Data loss can result in monetary Backups are required on a regular basis,
loss as well as legal difficulties if and lost data may be reloaded from the OLTP
backups are not performed on a database as needed.
regular basis.
It is difficult to combine OLAP and OLTP into a single centralised database system. The dimensional data
design model utilised in OLAP is far superior to the relational database query employed in the OLTP system
for querying. OLAP may utilise a single central database as a data source, but OLTP may use several data
sources from separate database locations. The dimensional architecture of OLAP is unsuitable for OLTP
systems, owing to redundancy and the loss of data referential integrity. The firm opts for two distinct
information systems, one OLTP and one OLAP. We may infer that OLTP systems are designed to bring data
into computers, whereas OLAP systems are designed to get data or information out of computers.
11.4 OLAP OPERATIONS IN A CUBE

The data in the multidimensional model are grouped into many dimensions, with each dimension
including numerous degrees of abstraction defined by idea hierarchies. This arrangement provides users
with the ability to see data from a variety of angles. There are a variety of OLAP data cube operations
available to show these various perspectives, including interactive queries and a search of the record at
hand. As a result, OLAP facilitates a user-friendly environment for interactive data analysis.
Consider the OLAP procedures that will be carried out on multidimensional data. The diagram depicts
data cubes for a shop’s sales. The dimensions, location, time and item are all aggregated in the cube,
4
with the location aggregated by city values, the time aggregated by quarters, and the item aggregated
by item kinds.
11.4.1 Roll-up
The roll-up operation (also known as the drill-up or aggregation operation) aggregates data on a data
cube by ascending down idea hierarchies, i.e., dimension reduction. Roll-up is analogous to zooming out
on data cubes. The graphic depicts the outcome of roll-up operations on the dimension location. The
location hierarchy is described as the Order Street, city, province or state, nation. The roll-up procedure
gathers data by rising the location hierarchy from the city level to the national level.
When dimensions reduction is used to perform a roll-up, one or more dimensions are eliminated from the
cube. Consider a sales data cube with two dimensions: location and time. Roll-up can be accomplished by
deleting the time dimensions that occur in an aggregate of total sales by place, rather than by location
and time.
Figure 1 shows the roll-up operation:
USA
2000
Canada
Q1 1000
(Quarter)
Q2
Time
Q3
Q4
Mobile Modern Phone Security item(types)
Roll- up on location (from cities to

Countries)
Chicago
440
New York 1550
Toronto 395
Vancouver
Q1 605 825 14 400

(Quarter)
Q2
Time
Q3
Q4
Figure 1: Roll-up Operation
5
JGI JAIN
In Figure 1, we are performing the roll-up operation for hierarchies of location dimensions (from cities
to countries).
11.4.2 Drilldown
The drill-down operation (also known as roll-down) is the inverse of the roll-up procedure. Drilling down
is similar to zooming in on a data cube. It moves from less detailed to more detailed data. Drilling down
may be accomplished by either stepping down a concept hierarchy for a dimension or by introducing
new dimensions.
Figure 2 shows the drill-down operation:
USA
440
New York 1550
Toronto 395
Vancouver
Q1 605 825 14 400
Q2
(Quarter)
Time
Q3
Q4

Drilldown on
Time (from quarters
to month)
Chicago
440
New York 1550
Toronto
395
Vancouver
Jan 150
Feb 100
Mar 150
Time (Month)
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Figure 2: Drill-Down Operation
6
Figure 2 depicts a drill-down operation on the dimension time conducted by stepping down a concept
hierarchy specified as day, month, quarter and year. Drill-down appears by decreasing the time
hierarchy from the quarter level to a more specific month level.
A drill-down may also be achieved by adding a new dimension to a cube since it adds more information
to the supplied data.
11.4.3 Slice
A slice is a subset of cubes that correspond to a single value for one or more-dimension members. A
slice operation is used when a client wishes to make a selection on one dimension of a three-dimensional
cube, resulting in a two-dimensional site. As a result, the Slice operations choose one dimension of the
provided cube, resulting in a sub cube. Here, we get details about one dimension by keeping other
dimensions at a particular level. Figure 3 shows the slice operation:
Location Location-Maharashtra
2010 P1 P2 P3 P4
2010 100 96
2011 100 96
2011 20 37
2012 20 37
2012 40 82
Time 2013 40 82
2013 60 63
2014 60 63
2014 80 42
P1 P2 P3 P4 2015 80 42
Figure 3: Slice Operation
In Figure 3, we consider only the location dimension with values of Maharashtra.
11.4.4 Dice
The logical partitioning of a cube considering more than one dimension is called dice operation. Here,
we get a sub-cube by selecting two or more dimensions. Figure 4 shows dice operation:
M. P
Location
Maharashtra
2010 100 96
Time
2011 20 37
P1 P2 P3 P4
Location
Figure 4: Dice Operation
Figure 4 shows a dice operation considering the time in 2010 and 2011 and locations as Maharashtra
and M.P.
7
JGI JAIN
11.4.5 Pivot
A pivot action is also known as a rotation. Pivot is a visualisation procedure that rotates the data axes in
view to show the data in a different way. It might include shifting the rows and columns or transferring
one of the row measurements to the column dimensions. Pivot is a way of visualising a cube from
different dimensional perspectives. In pivot, an analyst can rotate a cube to see its different faces. It is a
visualisation operation. Figure 5 shows pivot operation:
Time
2011
2010 100
Maharashtra 100
U. P 200
Location
A. P 300
P1 P2 P3 P4
Product
Figure 5: Pivot Operation
11.5 OLAP MODELS

OLAP models provides options for implementing the OLAP servers in various ways. There are main
four types of OLAP servers: ROLAP, MOLAP, HOLAP and DOLAP. But there are other types also gaining
popularity. Let us see the first three major types and then the other.
11.5.1 ROLAP (Relational OLAP)

Relational OLAP is the preferred technology when the database size is large, i.e., greater than 100 GB.
Here, the data will not be in a summarised form. Its response time is poor, minutes to hours, depending
on the query type. Figure 6 shows the architecture of ROLAP:
Dynamic Data Cube
Complex
Data SQL
Warehouse Client
Request
RDBMS Analytical
Server Server
Figure 6: Architecture of ROLAP

As the name implies, ROLAP systems are based on the relational data model. There are ROLAP clients
and a database server that is based on RDBMS. The OLAP server sends the request to the database
server and obtains the response. The multidimensional cubes are generated dynamically as per the
requirement sent by the user. To hide the structure from the user, a layer of metadata is created. This
layer supports mapping between the relational model and business dimensions.
8
In this architecture, the user interacts with the database frequently. The query is processed by the OLAP
server and/or database server.
This method allows for the creation of several multidimensional views of two-dimensional relational
databases, removing the need to structure data around the desired view. Few products in this category
have robust SQL engines to handle the challenge of multidimensional analysis.
The performance of a ROLAP is determined by the size of the data being processed; it might be sluggish
when the data being processed is huge and quick when the data is little. Although any SQL tool may
access ROLAP, these tools have limitations since SQL statements do not match all of the demands of all
users, especially when doing complicated computations. This includes writing several SQL statements
to manage user requests and being able to write SQL statements based on the optimiser of the DBMS
engine. While ROLAP’s flexibility is a novel feature, there are some products that need the usage of a
denormalised database design.
The advantages of the ROLAP architecture are:

 Provides flexibility as it can address ad hoc queries also. Moreover, there is no limit on the number
of dimensions.
 The tools are less expensive.
 RDBMS requires less storage.
 The network bandwidth requirements will be low because only the results are sent to the client over
the network.
 Tools are simple and easy to use.
The disadvantages of the ROLAP architecture are:

 Response time is poor.
 Iterative analysis and drill-down are very slow.
11.5.2 MOLAP (Multidimensional OLAP)

As ROLAP works on RDBMS, executing complex queries hampers the overall performance. To overcome
this drawback, the MOLAP architecture was proposed with the idea that multidimensional cubes must
be pre-computed. Figure 7 shows the architecture of MOLAP:
Required Summarised
Create and
Precompute
Data data cubes Proprietary
Warehouse data language Client
RDBMS MDBMS
Server Server MDDB
Figure 7: Architecture of MOLAP
9
JGI JAIN
The architecture of MOLAP system is shown in Figure 7. The system consists of:
 OLAP client that provides the front-end GUI for giving the queries and obtaining the reports.
 OLAP server is also known as Multidimensional Database Management System (MDBMS) server.
This is a proprietary database management system which stores the multidimensional data in
‘multidimensional cubes’ and contains the data in summarised form, based on the type of reports
required.
 A machine that carries out the data staging, which converts the data from RDBMS format to MDBMS
and sends the ‘multidimensional cube’ data to OLAP server.
The multidimensional database uses large multidimensional arrays as storage structures.

Multidimensional database management systems are not freely developed, but they are proprietary
software systems. Consolidation and fabrication of summarised cubes are provided by these systems.
The main advantages of MOLAP are:

 It provides excellent performance in terms of response time to queries as the data is already stored
in the summarised form.
 Only the results are sent to the client machine and hence the network bandwidth requirements are
low.
 Drill-down is very fast.
 It provides an interface that is effective and easy to use.
 Complex analysis can be done using MOLAP servers.
 MOLAP is capable of managing, analysing and storing large volumes of multidimensional data.
 Improved query performance as a result of better storage, indexing and caching.
 Data sizes are less as compared to a relational database.
 Data at a higher level of aggregates are computed automatically.
 Assist users in analysing bigger, less-defined data sets.
 MOLAP is simpler for the user; hence it is a good model for beginner users.
 MOLAP cubes are designed for quick data retrieval and are best suited for slicing and dicing
activities.
The disadvantages of MOLAP are:

 MDBMS are proprietary to the vendor, who sells the data warehouse development tools.
 The cubes require high storage.
 Data cubes have to be updated periodically as the summaries keep changing.
 MOLAP tools are costlier as compared to ROLAP tools.
 MOLAP lacks flexibility because the data cubes in the server have to be predefined keeping in view
the requirements analysis. Hence, drill-up is very difficult.
 MOLAP solutions may be time-consuming, especially when dealing with huge amounts of data.
10
 MOLAP products may encounter problems when updating and querying models with dimensions
more than 10.
 MOLAP is incapable of storing comprehensive data.
 If the data set is dispersed, storage usage may be poor.
Hence, multidimensional OLAP is used when the database size is small, say less than 100 GB and the
data is available in the summarised form.
11.5.3 HOLAP (Hybrid OLAP)

Hybrid OLAP is an amalgamation of the ROLAP and MOLAP models. This system tries to accommodate
the advantages of both models. ROLAP has a good database structure and simple queries can be handled
efficiently. On the other hand, MOLAP can handle complex aggregate queries faster. However, MOLAP is
computationally costlier.
So, we can have a midway. In HOLAP, the relational database structure is preserved to handle simple and
user-required queries. Instead of computing all the cubes and storing them in MOLAP server, HOLAP
server stores only some important and partially computed cubes or aggregates so that when we require
higher scalability and faster computation, the required aggregates can be computed efficiently. Thus,
HOLAP possesses the advantages of both ROLAP and HOLAP. Figure 8 shows the architecture of HOLAP:
RDBMS
Server
SQL Query
Required Summarised data

Data
Warehouse Cube Query Complex Query
MOLAP
Engine Client
Summarised data
MDBMS Data Cubes
Server
Figure 8: HOLAP Architecture
In a HOLAP system, the user may access data from both the relational and multidimensional
databases by utilising basic queries for relational databases and complicated aggregate searches for
multidimensional databases. Because the querying process is complex and the same queries may be
performed on a multidimensional system, it is more effective to design a cached data server just for
aggregates and data that is repeatedly requested from the relational database.
The Analysts or Reporting experts, who are the Users, query the HOLAP to retrieve the relevant data.
Querying encompasses both relational and multidimensional databases. Simple searches for relational
databases convey data to the cache system, while complicated queries are kept in the cache system with
the data queried for the current flow. The data retrieved from both systems is then offered to the user
for analysis or creation of reports for corporate decision-making processes.
11
JGI JAIN
Because the multidimensional system is complicated, caching must be maintained on a continuous

basis, that is, every time an update is done on the MDDB system. Because of its basic structure, RDBMS
does not require caching. RDBMS uses the Cache mechanism to integrate the results with those obtained
from the MDDB. Because the relational database focuses on the amount of data and the MDDB focuses
on the valuation of data, this paradigm allows for more efficient data flow. When aggregates are cached,
the queries are saved so that the work of constructing and applying the same queries is not wasted.
This improves the speed of the HOLAP system while consuming a large amount of storage space. Because
HOLAP can benefit from both ROLAP and MOLAP, the loss of storage capacity is unavoidable.
The advantages of HOLAP are as follows:

 Better accessibility as compared to ROLAP and MOLAP models.
 Faster querying thanks to the Caching feature, which reduces the amount of time spent going back
and forth.
 Query performance is moderate, faster than ROLAP but slower than MOLAP.
 Because only accurate data is collected for processing, cubes are smaller than MOLAP.
 When data volume is likely to grow over time, this is the ideal choice.
 When compared to ROLAP and MOLAP systems, it has a higher processing capability.
 Because HOLAP contains a relational database, it can hold a large amount of data.
The disadvantages of HOLAP are as follows:

 Better accessibility as compared to ROLAP and MOLAP models.
 Faster querying thanks to the Caching feature, which reduces the amount of time spent going back
and forth.
 Query performance is moderate, faster than ROLAP but slower than MOLAP.
 Because only accurate data is collected for processing, cubes are smaller than MOLAP.
 When data volume is likely to grow over time, this is the ideal choice.
 When compared to ROLAP and MOLAP systems, it has a higher processing capability.
 Because HOLAP contains a relational database, it can hold a large amount of data.
11.5.4 DOLAP
When the OLAP communicates with desktop databases to retrieve the data, it is called Desktop OLAP
(DOLAP). It is a technology by which the user is able to download a part of the OLAP locally and then may
perform the required processing locally. DOLAP provides the data after performing a multidimensional
analysis of the client machine, which has already obtained the data from multidimensional database
servers.
DOLAP is a single-tier technology that enables users to download a small hypercube on their desktop
machine from a central point or server to perform multidimensional analyses. This central point can be
a data mart or data warehouse. Note that in the case of DOLAP, during the process of multidimensional
analyses the client machine remains disconnected from the server. The downloaded cubes, generally,
contain summarised data or are organised in a fixed structure of dimensions.
12
The advantages of DOLAP are as follows:

 Facilitates user to modify or calculate data locally from the result set obtained from the server
 Provides good query performance in collecting, aggregating, and calculating the data during
multidimensional analysis
 Useful for mobile users as they cannot always stay connected with the data warehouse.
The only disadvantage of DOLAP is that it provides limited functionality and data capacity.
11.6 LAB EXERCISE

Perform various OLAP operations such as slice, dice, roll up, drill down and pivot on the warehouse of
the hospital.
Ans. In the scenario of the hospital with three dimensions Doctor, Patient, Time and measures Count
and Fees, the basic OLAP cube is shown as follows:
02 03 01
04 01 01
Fact: Count
Q1 (No of times
04 01 01
Patient visited
Doctor)
Q2 07 10 01
Time
Q3 02 02 03
Q4 01 02 03
P1 P2 P3
Patient
500 1000 100

1000 200 100
Fact: Fees
Q1 1000 200 100 Sum of
(Fees Paid by
Patient)
Q2 2000 4000 100
Time
Q3 500 500 400
Q4 100 600 300
P1 P2 P3
Patient
13
JGI JAIN
And, there are basically five OLAP operations, which are as follows:
 Slice: The following figure shows the Slice operation:
500
1000
Q1 1000
Q2 2000
Time
Q3 500
Q4 100
P1
Patient
 Dice: The following figure shows Dice operation:
500 1000
1000 200
Q1 1000 200
Time
Q2 2000 4000
P1 P2
Patient
 Drill down: The following figure shows Drill operation:
Drill Down for fees

100 500 100
200 100 00
Jan 200 100 00

Time
Feb 400 00 00
March 400 100 100
P1 P2 P3
14
 Rollup: The following figure shows Rollup operation:
Roll Up for fees

2000
3700 5300 900
2010 5300 900

Time (Year)
3700
2011
2012
P1 P2 P3
 Pivot: The following figure shows Pivot operation:
Patient
Q4 100
100 100
Q3 500 500 400
Q2 2000 4000 100
Q1 1000 200 100
D2
1000 200 100

Doctor
D1
500 1000 100
 OLAP is a processing approach that performs multidimensional analysis of business data.

 OLAP is different from Online Transaction Processing (OLTP), which is another processing approach
characterised by various online transactions, including INSERT, UPDATE and DELETE.
 OLTP processing approach gives very fast query processing and helps in maintaining data integrity
in multi-access environments.
 The architecture of OLAP for multidimensional cubes differs significantly from that of OLTP (Online
Transactional Processing) for databases.
 OLAP is intended for data access and analysis to aid management users in strategic decision-
making.
 The data in the multidimensional model are grouped into many dimensions, with each dimension
including numerous degrees of abstraction defined by idea hierarchies.
 The roll-up operation (also known as the drill-up or aggregation operation) aggregates data on a
data cube by ascending down idea hierarchies, i.e., dimension reduction.
15
JGI JAIN
 The drill-down operation (also known as roll-down) is the inverse of the roll-up procedure. Drilling
down is similar to zooming in on a data cube. It moves from less detailed to more detailed data.
 A slice is a subset of cubes that correspond to a single value for one or more-dimension members.
 The logical partitioning of a cube considering more than one dimension is called dice operation.
 Pivot is a visualisation procedure that rotates the data axes in view to show the data in a different
way.
 Relational OLAP is the preferred technology when the database size is large, i.e., greater than 100
GB.
 MOLAP architecture was proposed with the idea that multidimensional cubes must be pre-computed.
 In HOLAP, the relational database structure is preserved to handle simple and user-required queries.
 DOLAP is a technology by which the user is able to download a part of the OLAP locally and then
may perform the required processing locally.
11.8 GLOSSARY
 Online Analytical Processing (OLAP): A processing approach that performs multidimensional

analysis of business data
 Online Transaction Processing (OLTP): A processing approach characterised by various online
transactions, including INSERT, UPDATE and DELETE
 Roll-up operation: It aggregates data on a data cube by ascending down idea hierarchies, i.e.,
dimension reduction
 Slice: A subset of cubes that correspond to a single value for one or more-dimension members
 Dice: The logical partitioning of a cube considering more than one dimension
 Pivot: A visualisation procedure that rotates the data axes in view to show the data in a different
way

1. Explain briefly about OLAP?
2. Explain briefly the operations of OLAP.
3. Explain the concept of MOLAP model.
4. Discuss the need of OLAP.
5. List down the difference between OLTP and OLAP.
16

1. OLAP is a processing approach that performs multidimensional analysis of business data. It also
facilitates the performance of complex calculations, trend analysis and complicated data modeling.
OLAP is different from Online Transaction Processing (OLTP), which is another processing approach
characterised by various online transactions, including INSERT, UPDATE and DELETE. Refer to
2. The data in the multidimensional model are grouped into many dimensions, with each dimension
including numerous degrees of abstraction defined by idea hierarchies. This arrangement provides
users with the ability to see data from a variety of angles. There are a variety of OLAP data cube
operations available to show these various perspectives, including interactive queries and a search
of the record at hand. Refer to Section OLAP Operations in a Cube
3. The MOLAP architecture was proposed with the idea that multidimensional cubes must be pre-
computed. The architecture of MOLAP system is shown in Figure 7. The system consists of:
 OLAP client that provides the front-end GUI for giving the queries and obtaining the reports.
 OLAP server is also known as Multidimensional Database Management System (MDBMS) server.
Refer to Section OLAP Models
4. The following points depict the need of OLAP in organisations:
 Tool for analysis of huge data i.e., billions of rows or petabytes of data was in demand. The
analysis must be fast and should be interactive.
 There was a need for multidimensional analysis as data was coming from various parts of an
enterprise or from data marts. Refer to Section Need for Online Analytical Processing
5. The following table summarises the difference between OLTP and OLAP:
Differences OLTP OLAP

Characteristics Can handle a high volume of small online Handles large volumes of data.
transactions.
Query Simple queries, such as inserting, Complex queries which require aggregations.
deleting, and updating data.
Database Normal, with a lot of tables. Typically, fewer tables are used, and star or
Design snowflake schemas can be used.
Refer to Section OLTP vs. OLAP
 https://minigranth.in/data-warehouse-tutorial/olap-operations
 https://hevodata.com/learn/olap-models/
17
JGI JAIN
 Discuss with your friends the concept of OLAP and OLTP. Also, discuss the operations and models of
OLAP.
18
UNIT
12 ETL Process
Names of Sub-Units
Need of ETL Process, Data Extraction, Identification of Data Sources, Immediate Data Extraction,
Deferred Data Extraction, Data Transformation, Techniques of Data Loading, Loading the Fact Tables,
Loading the Dimension Tables, Challenges in ETL Functions
Overview
The unit begins by discussing the concept of ETL and the need of ETL process. Then, the unit explains
the data extraction. Further, the unit outlines the data transformation and tasks involved in data
transformation. Next, the unit covers the techniques of data loading. After this, the unit explains how
to load the fact tables and dimension tables. Towards the end, the unit sheds light upon the challenges
in ETL functions.
Learning Objectives

 Describe the concept of ETL and the need of ETL process
 Explains the data extraction and its types
 Defines data transformation and tasks involved in data transformation
 Explains the techniques of data loading
 Describe the loading the fact tables and dimension tables
JGI JAIN
Learning Outcomes

 Assess the knowledge ETL and the need of ETL process
 Examine the immediate data extraction and deferred data extraction
 Analyse the data transformation and tasks involved in data transformation
 Evaluate the loading the fact tables and dimension tables
 Assess the knowledge about the challenges in ETL functions
 https://www.immagic.com/eLibrary/ARCHIVES/GENERAL/WIKIPEDI/W121031E.pdf
12.1 INTRODUCTION
ETL is a process that takes data from various source systems, transforms the data (such as applying
computations, concatenations and so on) and ultimately loads the data into the Data Warehouse system.
ETL stands for Extract, Transform and Load.
It is easy to believe that constructing a Data warehouse entails merely pulling data from many sources
and loading it into a Data warehouse’s database. This is far from the case and a sophisticated ETL
procedure is required. The ETL process, which is technically demanding, necessitates active participation
from a variety of stakeholders, including developers, analysts, testers and senior executives.
To preserve its usefulness as a decision-making tool, the data warehouse system must evolve in tandem
with business developments. ETL is a periodic (daily, weekly, monthly) function of a data warehouse
system that must be agile, automated and properly documented.
In the first step of ETL, data is extracted from the data stores, such as text file, spreadsheets or incoming
data. The data different from the last execution of the ETL process is extracted. Here we use different
methods of extraction, such as logical extraction and physical extraction methods. Then, the data
cleaning procedure is executed. The next step is a transformation where we can include simple data
conversions and extremely complex data scrubbing techniques at the same time. Error correction
technique is applied in this step. The final step is loading where we load the data into the data warehouse.
12.2 NEED OF ETL PROCESS

The need for data processing arises from the fact that real-time data is often incomplete and inconsistent.
The databases store a large amount of data from different data sources dealing with different text
files and spreadsheets. To use these data, we use the process of Extract, Transform and Load. ETL is
a combination of complex processes. It is an integral part of a data warehouse as once the process is
defined; it runs automatically at pre-defined intervals.
2
UNIT 12: ETL Process JGI JAIN
There are several reasons why a business should implement ETL:

 It assists businesses in analysing their company data to make key business choices.
 Transactional databases, for example, cannot answer complicated business queries that ETL can.
 A data warehouse serves as a central repository for data.
 ETL is a way of transporting data from several sources into a data warehouse.
 The Data Warehouse will automatically update when data sources change.
 Well-designed and documented ETL processes are practically necessary to the success of a Data
Warehouse project.
 It allows for the validation of data transformation, aggregation and computation rules.
 The ETL procedure enables the comparison of sample data between the source and target systems.
 The ETL procedure might execute complicated transformations and necessitate more storage space.
 ETL aids in the migration of data into a data warehouse. Convert between numerous formats and
kinds to maintain a consistent system.
 ETL is a specified method for accessing and changing source data to insert it into the destination
database.
 ETL in a data warehouse provides the organisation with extensive historical context.
 It boosts productivity by codifying and reusing information without requiring technological knowledge.
12.3 DATA EXTRACTION

Data extraction is the process of collecting data from a database or SaaS platform so that it may be
duplicated to a destination built to allow online analytical processing (OLAP), such as a data warehouse.
The first phase in a data ingestion process known as ETL extract, transform and load is data extraction.
ETL is used to prepare data for analysis or business intelligence (BI).
The main function of the extraction element is to extract data from the source system. When the
extraction process takes place, the original data are retained by the source system and a copy of the
data is taken for processing. To maintain OLTP performance and efficiency, we shift historical data from
an operational OLTP system to a data warehouse. Legacy data are usually copied in the same form in the
database as these systems are being forced to process the data. To perform the needed transformation,
we keep the extracted data in the staging area of the data warehouse. Extraction of data is easy in
the case of relational databases as it is normally performed in the same source system. The logic of
extraction takes place in the data staging area. The data that are to be extracted is contained in the
source system. This source system may use ODBC, OLEDB or others API’s.
To extract data from legacy systems, you need to produce text files. However, these days, various new
systems also offer the facility of directly querying APIs and accommodating access through ODBC or
OLEDB. Custom applications are used for data extraction processes. These applications are developed
in programming or scripting languages. These languages may include Transact-SQL stored procedures
and Data Transformation Services (DTS) tasks. The two main methods of data extraction are:
 Logical extraction methods
 Physical extraction methods
3
JGI JAIN
12.3.1 Logical Extraction Methods

The methods used for extracting the data logically are known as logical extraction methods. The logical
extraction allows you to decide the following:
 Appearance of information to users
 Way of organisation and retrieval of information
 Representation of information using more than one logical view
There are two types of logical extraction methods:

 Full Extraction: In this type of extraction, the data are extracted completely at one time from
the source system. There is no need to keep extra information because this method itself reflects
complete data available in the data source.
 Incremental Extraction: This data extraction method depends on the time when the data are to
be extracted. In the process of incremental extraction, the data which have been changed after a
specified interval of time is extracted. The extraction may be defined by an event that had occurred
in history. This event may define the last time of extraction; thus, the data that have changed after
this event has occurred are identified. This data change is either provided by the source data itself,
such as a data column showing the last change or a separate table where any addition in the data
information keeps getting recorded. The incremental method entails additional logic of maintaining
a separate table.
Generally, many data warehouses do not use this extraction process; rather, the entire table is
extracted from the source system to the data warehouse and it then compares the data with the
already existing data and figures out the required changes that have taken place since the last
extraction. Although this approach might prove to be simpler for the source data, it places a huge
burden on the data warehouse processes, especially in cases where the data volume is too high.
Therefore, incremental data extraction is preferred.
Incremental data extraction can be done in the following two ways:
 Immediate data extraction
 Deferred data extraction
12.3.2 Physical Extraction Methods

Depending upon the logical method adopted for the extraction of data and the limitations and
capabilities of the source data, two physical extraction methods are applied to the extracted data. The
data can be extracted in either of the following two ways:
 Online from the source system
 An offline structure that is already in use or created by an extraction procedure
The two mechanisms of physical extraction are as follows:

 Online extraction: In this method of extraction, the data deal with the sources directly. There is one-
to-one connection between the extraction process and source to extract data. It can also connect to
an intermediate system that stores the data in a preconfigured manner (for example, snapshot logs
or change tables). It is not necessary that an intermediate system differs from the source system
4
physically. In this method of extraction, we have to assume whether the source object or the prepared
source object is being used by the distributed transactions.
 Offline extraction: In this method of extraction, the data do not deal with the sources directly, but
they are produced outside the source system. The data are either in a pre-existing format (for
example, archive logs, transportable tablespaces or redo logs) or designed by some extraction
routine. A few of these external structures are:
 Flat files: In flat files, the data are in a defined, generic format. Some additional information is
required about the source object for further processing.
 Dump files: Dump files are created in the Oracle-specific format. The information about the
objects can be included or may not be included. It depends on the incorporated utility.
 Redo and archive logs: These are special additional dump files that contain information.
 Transportable tablespace: It is a powerful medium to extract the data and then moves the
huge amount of data between Oracle databases. Transportable tablespaces must be advisable
as they have added advantages in terms of efficiency and full-proof processing over other
extraction techniques.
12.4 IDENTIFICATION OF DATA SOURCES

To determine which data sources will be used to populate the data warehouse.
The data sources may be discovered once the data warehouse topic areas and business queries have
been specified. The source of data is as follows:
 Sources of inventory data: Model or illustrate operational data sources using the current data
element, table, file and database names and descriptions. Apply the same model to all data sources
(for example, model flat files and hierarchical and network databases using the same model used
for relational databases).
 Identify data elements and data stores: Determine the data repositories (e.g., database tables, files)
and the pieces needed to create business queries. To connect business inquiries to sources and
objects, use an association matrix.
To store data in a data warehouse, the first step is the identification of the source. It is one of the most
critical steps involved in the process of data extraction.
The steps involved in source identification are as follows:
 Every data item stored in the data warehouse needs to be associated with the input source.
 Identify the fact and various attributes involved with the dimension tables.
 For multiple data sources for the same field, use the proper preferred source.
 Use appropriate rules to map a single data item to be distributed to multiple fields.
 Find missing values, corrected values and default values.
12.5 IMMEDIATE DATA EXTRACTION

In this technique, the data extraction is done in real-time. Extraction occurs at the same time as the
transactions are taking place at the source database and files.
5
JGI JAIN
Figure 1 shows immediate data extraction process:
Data Staging Area Option 3:

Capture in source
application
Option 2:
Capture through Output Files Extract Files
database triggers of Trigger from Source
Programs Systems
Option 1:
Capture through
transactions logs Source
Databases
Source
Source Operational
Data Trigger Systems
Programs
Transaction Log
DBMS
Files
Figure 1: Immediate Data Extraction Process
Immediate data extraction is facilitated by the following three processes:

 Capture through transaction log: This option takes the help of transaction logs of the DBMS, which
are maintained for data recovery in case of failure. For every transaction like addition, updation or
deletion of a row from a database table, the DBMS immediately makes an entry in the transaction
log file. This transaction log is read by the data extraction technique and selects all the committed
transactions. Thus, there is no extra expense in the operational systems because logs are already
maintained for transaction processing.
 Capture through database triggers: This option applies to all the database applications. As you
know, triggers are specially stored procedures that are stored in the database and are fired when
certain predefined events occur. Triggers are created for all the events in which the data are to be
captured. When a trigger is fired, it generates an output that is stored in a separate file. This file is
used to extract data for the data warehouse. For example, if you need to capture all the changes in
the employee table, write a trigger program to capture all updates and deletes them in that table.
 Capture through source applications: This is also known as application-assisted data capture.
Here, the source application itself assists in the data capture for the data warehouse. You have
to accordingly modify the relevant application programs that write to the source files and
databases. You rewrite the programs to include all adds, updates and deletes to the source files as
well as database tables. Other extract programs can use separate files containing changes to the
source data.
12.6 DEFERRED DATA EXTRACTION

Compared to immediate data extraction, in this technique, data extraction does not capture the changes
in real-time.
6
UNIT 12: ETL Process JGI JAINDEEMED-TO-BE UNI VE RSI TY
Figure 2 shows the deferred data extraction process:
Data Staging Area
Options 2:
Capture by
comparing files Options 1: Capture
Extract file based on data Extract file
based in File and time stamp based on time
File
comparison stamp
Comparison
Programs
Extract
Source Programs
Databases
Yesterday’s
Extract Source
Source Operational
Data Systems
Today’s
DBMS
Extract
Figure 2: Deferred Data Extraction Process

Deferred data extraction takes place in the following two ways:
 Capture based on date and timestamp: In this case, whenever you create or update a source record,
it is marked with a stamp that contains the date and time of creation or update. On the basis of
this timestamp, records for data extraction are selected. Here, the data capture is different and
capturing is not done while each source record is created or updated, but it is done at a later stage.
This technique works efficiently for the system where the number of revised records is small.
However, deleting source records can cause problems. If a source record is deleted during the two
extraction operations, the information about the deleted record cannot be identified. This problem
is rectified by marking the source record for delete first, doing the extraction run and then going
ahead and physically deleting the record.
 Capture by comparing files: In this technique, two snapshots of the source data are compared
and therefore, it is known as a snapshot differential technique. In this technique, it is necessary
to keep prior copies of all the relevant source data as they may be required for comparing data in
the future. Though the technique is simple, the actual comparison of full rows in a large file can be
very time-consuming and may prove inefficient in the long run. This technique is useful for legacy
data sources. These legacy data sources do not have the logs of the tasks that have been currently
executed or timestamps on the source records.
12.7 DATA TRANSFORMATION

Transformation is the process of applying a set of rules to extracted data before immediately putting the
data from the source system into the destination system. The extracted data is referred to as raw data.
The transformation process employs a set of standards to convert all disparate data from numerous
source systems into useable data in the DW system. The goal of data transformation is to improve data
quality. All of the logical transformation rules may be found in the data mapping paper.
7
JGI JAIN
If any source data does not comply with the transformation requirements, it is rejected before being loaded
into the target DW system and stored in a reject file or reject table. The transformation criteria are not stated
for the straight load columns data from source to destination (which does not require any changes). As a
result, data transformations may be divided into two types: basic and sophisticated. Column conversions,
data structure reformatting and other data transformations are examples of data transformations.
12.7.1 Tasks Involved in Data Transformation

Some of the tasks to be completed during data transformation are listed as follows:
1. Selection: From the source systems, you may pick either the complete table data or a selected
collection of columns of data. Typically, data selection is accomplished during the Extraction process.
If the source system does not allow for the choice of a suitable group of data columns even during
extraction stage, the data should be extracted in its entirety and the choice handled during the
transformation stage.
2. Splitting/joining: Splitting or joining the selected data allows you to alter it. During the
transformation, you will be requested to further separate the specified source data. For instance, if
the entire address is kept in a single huge text field in the source system, the DW system may request
that the address be separated into individual fields, such as a city, state, zip code and so on. This
allows for easy indexing and analysis based on each component separately.
In the DW system, joining/merging two or more columns of data are commonly utilised during the
transformation step. This does not imply combining two fields into one.
3. Translation: The collected data from the source systems may be in different forms for each
data category, all extracted data should be translated into a standardised format during the
transformation step. The same approach is simple to learn and apply to business choices.
4. Recapitulation: In some cases, DW will seek summary data from the source systems rather than low-
level detailed data. Because low-level data is not well suited for business users’ analysis and querying.
For example, while the DW system may not require sales data for each checkout, daily sales by-
product (or) daily sales by the shop are useful. As a result, data summarising may be conducted
during the transformation process based on business needs.
5. Expansion: When a DW column is created by merging one or more columns from different records,
data enrichment will re-arrange the fields in the DW system for a better view of the data.
6. Format changes: Format changes are particularly common during the transition phase. Each
column’s data type and length are updated.
For example, in one source system, a column may be numeric, yet in another source system, the
same column may be text. To standardise this, the data type for this column is changed to text
during the transformation process.
7. Field decoding: When you extract data from numerous source systems, the data may be decoded
differently in each system.
One source system, for example, may indicate customer status as AC, IN and SU. A different system
might indicate the same status as 1, 0 and -1.
8
During the data transformation process, you must decode such codes into correct values that
business users can comprehend. As a result, the preceding codes can be altered to Active, Inactive
and Suspended.
8. Values calculated and derived: DW can hold additional column data for computations by taking
into account the source system data. Before saving information in DW, you must do the calculations
based on the business logic.
9. Conversion of Date/Time: This is one of the most important data kinds to focus on. The date/time
format may change between source systems. For example, the data in one source could be dated
December 15, 1999. Another source could have the same date as of 12/15/1999. As a result, all date/
time values should be translated into a standard format throughout the data transformation.
10. De-duplication: If there are duplicate records in the source system, ensure that only one record is
imported into the DW system.
Figure 3 shows the flow of data transformation:
Start
Apply transformation rules

on the staged data
Is
data No
Splitting, joining, conversions, valid?
summarisation, format revisions,
decoding, etc.
Yes
Perform the
transformation tasks
Report the error
Load the transformed

data into staging
Stop
Figure 3: Data Transformation Flow
9
JGI JAIN
12.7.2 Ways of Transforming Data

The following are some ways to transform data:
 Using SQL
 Using PL/SQL statements
 Using table functions
Transforming Data Using SQL
As soon as the data are loaded into a database, the SQL operations are performed for data transformation.
The four basic techniques for implementing SQL data transformations are as follows:
 CREATE TABLE ... AS SELECT and INSERT /*+APPEND*/ AS SELECT: The CREATE TABLE ... AS SELECT
statement is a strong tool to change large sets of data. The INSERT /*+APPEND*/ ... AS SELECT
statement offers the same features as that of the existing database tables.
Generally, data transformation is done through data substitution. In data substitution, a few or all
the values of a single column are changed. Let us assume that our sales table has one of the column
names as channel_id.
In a data warehouse, data can be received from multiple points/sources. This column shows that
the mentioned sales transaction is generated by the own sales force, which is a form of a direct sale.
Also, it can be generated by a distributor which is another form of indirect sale. Therefore, the NULL
values must be given a proper key value.
In the following example, we us the SQL function. Here, we insert the record by inserting the various
fields into the sales table.
The structure of the source table sales_monthwise_direct is as follows:
DESC sales_monthwise_direct
Name Null? Type
Invoice_DATE DATE
Item_ID NUMBER
Cust_ID NUMBER
Promocode_ID NUMBER
Total_Amt NUMBER
Total_Quantity NUMBER
In the preceding code, the data are inserted from sales_monthwise_direct into the sales table. Here,
we truncate the sales data values to the midnight time using SQL function and we have fixed channel
ID and a given value of 13:
INSERT INTO sales SELECT item_id, cust_id, TRUNC(Invoice_date), 13,
promocode_id, Total_quantity, Total_amount
FROM sales_monthwise_direct;
 Transforming Data Using UPDATE: Data substitution can be carried out by using an UPDATE
statement to adjust the sales.channel_id column. The UPDATE statement gives the right result.
However, if a large number of rows need to be adjusted as per the data substitution transformation
10
changes, then the CREATE TABLE AS SELECT (CTAS) statement is proved more proficient than an
UPDATE statement.
 Transforming Data Using MERGE: The MERGE keyword is used to update or insert a row conditionally
into a table. Conditions are specified in the ON clause.
 Transforming Data Using Multitable INSERT: In data warehouse environments, it is quite common
to diversify the same source data into several target objects. Thus, multitable insert proves to be
a new solution for transformations in the form of a SQL statement. On the basis of the business
transformation rules, the data may have several or exactly one target in case of multitable INSERT.
In multitable INSERT, the data can either end up in several or exactly one target on the basis of the
business transformation rules. The insertion process is usually done on the basis of business rules
or unconditionally. The multitable INSERT statement gives more benefits than the INSERT ... SELECT
statement where multiple tables are involved. In doing so, one drawback is that you need to work
with n independent INSERT … SELECT statements, thereby processing the same source data n times
and increasing the transformation workload n times.
For better performance in the direct-load mechanism, we can use the existing statements, for
example, INSERT…SELECT, in parallel with the new statements.
Transforming Data Using PL/SQL

Using Procedural language, complex transformations are resolved in data warehousing. We usually use
PL/SQL language for processing the transformation as they use the row-based approach and CREATE
TABLE AS SELECT (CTAS) operates on the entire table and helps in achieving parallelism. Let’s us suppose
that we use a PL/SQL procedure that reads the data from a variety of sources using multiple cursors
and we merge these data using business rules. As a result, we insert the transformed data in more than
one target table, but it is quite difficult to give the sequence of operations.
A transformation that is processed using a procedural language can be encapsulated using the ETL
process.
Transforming Data Using Table Functions

Table functions that are implemented in PL/SQL, C or Java are used for pipelined or parallel execution
of transformations. This can be achieved without using the intermediate staging tables that interrupt
the data flow over distinct transformation steps.
A table function is defined as a function that produces a set of rows as output. Moreover, table functions
can obtain a set of rows as input. Prior to Oracle9i, the PL/SQL functions could not:
 Accept cursors as input
 Be parallelised or pipelined
Table functions extend the functionality of database by allowing:

 Various rows to be returned from a function
 Results of SQL subqueries to be passed directly to functions
 Functions to take cursors as input
 Functions to be parallelised
11
JGI JAIN
12.8 TECHNIQUES OF DATA LOADING

Loading is relatively simpler than the other two processes. It is desirable to disable any constraints and
indexes before starting the load process and enable them back only after the load process completes.
The ETL tool is used to maintain referential integrity and ensure the consistency of data.
The mechanisms which are used for loading a data warehouse are as follows:
 Loading a data warehouse with SQL*Loader: In this type of loading, the raw data must be accessed
to the database before any data transformation. For this, the data are loaded into the database.
Here, we use a technique for the data that are to be transported in terms of flat files.
We use SQL*Loader for transferring data from flat files to an Oracle data warehouse. When this
data load happens, SQL*Loader is used to implement basic data transformations. While using
direct-path SQL*Loader, various data manipulation operations, such as converting data type and
handling simple NULL values, can be automatically resolved during the data load. Generally, direct
path loading is used by data warehouses to enhance performance.
The conventional-path loader provides a more deep facility for data transformation than a direct-
path loader. SQL functions can be applied to any column as those values are being loaded. During
the data loading operation, it provides a rich capability for transformations. However, a direct-
path loader is faster than a conventional-path loader. That’s why the conventional-path loader is
considered primarily only when the amount of data to be loaded and transformed is less.
The file that contains the sales transaction sh_sales.dat is combined on a daily level. This is an
external file and all the columns are not loaded into sales. These files are used for loading the second
fact table and in this case, the sh sample schema is produced.
 Loading a data warehouse with external tables: The external data sources are also handled using
external tables. External tables are those that do not reside in the databases. They can be in any
format. We use the external data in the form of a virtual table, which is an external table feature of
Oracle. You can query this virtual table and join it directly or in parallel without loading the external
data in the database. SQL, PL/SQL and Java are used to access the external data.
By using external tables, the phase of loading and the transformation are pipelined, thus providing
uninterrupted data streaming. For the next stage, we do not need the data that have been already
staged because the data might contain the comparisons or transformation of data.
For example, the functionality of conversion of a conventional load can be used for a direct-path
statement like INSERT AS SELECT in conjunction with the SELECT statement from an external table.
The major difference between external tables and regular tables is that the external tables are read-
only. Therefore, no DML operations, such as UPDATE or INSERT or DELETE, can be executed and no
indexes can be created on them. External tables are particularly useful in places where the complete
external source has to be joined with the existing database objects or where the data have to be
transformed in a complex manner. For example, you may use the arbitrary SQL transformation and
direct path insert method instead of different SQL*Loaders.
You may create an external file by the name of sh_sales.dat in which we represent the external table
by the name of sales_transactions_ext. The product and time will be represented in the department
12
of Product where the cost will be analysed. The cost will be analysed in the fact table of cost in the sh
schema. As we are not analysing every dimension table, the operational source data are similar to
the sales fact table. This is due to the fact the sales fact table does not have the coarser granularity
as in the cost fact table (aggregation is applied on all various distribution channels).
Due to the suppression of a few of the dimensions, we cannot load the data in the cost fact table.
This cost table also needs to implement the aggregation of the detailed information; otherwise, we
cannot load the information in the cost fact table. The solution to this problem is to use the external
table framework where we use the SQL*Loader.
 Loading data warehouse with OCI and direct-path APIs: Flat file system is not needed when the
computation and transformation processes are performed outside the database.
 Loading data warehouse with export/import: When we want to insert data in the same way as we
have received the data into the target system, the features of import and export are used.
Loading takes place in three phases in the data warehouse. The first is ‘Initial load’ in which we
populate all the data warehouse tables for the first time. The second one is the ‘Incremental Load’ in
which we apply ongoing changes periodically and the third one is ‘Full Refresh’ in which contents of
one or more relations are erased completely and then reloaded with fresh data. These three phases
are explained as follows:
 Initial load: In initial loading, you can load the whole data warehouse in a single run or split the
load into separate sub loads.
 Incremental load: Incremental loading performs loading periodically and keeps a track of the
changes using the source system after the initial load is performed.
 Full refresh: Data are periodically rewritten in the entire data warehouse. It means the entire
existing data will be erased or truncated before writing the new data.
The data in the data warehouse can be applied using the following four modes:
 Load: The data are loaded in empty tables; if the tables already exist, then the data are erased
and reloaded.
 Append: If the data do not exist, they are added to the record; else, if the tuple already exists, it
may be duplicated or rejected depending on the user.
 Destructive merge: If the record is new, we add the record to the warehouse. The record is
updated in the case when there is a match in the primary key of the new record against the
existing record.
 Constructive merge: If there is a match in the primary key of the new record with the current
record, then in this case, we leave the old record and at the same time add the new record as
superseding the old record.
12.9 LOADING THE FACT TABLES

A fact table is central table that contains the measures or facts of a business process. A fact table
contains two types of fields—facts or measures and foreign keys from dimension tables.
13
JGI JAIN
For example, in the sales fact table, a measure can be the sales quantity presented in Table 1:
Table 1: Sales Data of a Trading Organisation
Region Product Sales in Sales in the Sales in Sales in the

Quarter-I Year Quarter-II Year
Region I Product I (TV) 20 20 25 45
Region I Product II (Washing machine) 30 30 35 65
Region 2 Product I (TV) 70 70 60 130
Region 2 Product II (Washing machine) 60 60 50 110
A data mart can contain one fact table to address one subject. In such a case, when several data marts
are integrated to create a data warehouse, it is important that the facts in each table mean the same
thing. Such facts (i.e., measures or metrics) that have the same meaning in different data marts are
called conformed facts.
Granularity of data is a very important factor in the design of a fact table. In a fact table, we can represent
data at different atomic levels called grains. In the above example of sales data, if the transaction data
do not record the model number of the product sold to each customer, then the sales volume represents
the quantity of the product, not each model. So, some useful data are lost and we cannot get business
intelligence related to models of the products. In the fact table, what detail has to be stored (i.e., what
granularity is required) is decided based on the business intelligence reports to be generated.
12.10 LOADING THE DIMENSION TABLES

A dimension table gives the context of the measure/metric indicated in the fact table. Dimension table
answers one of the following questions: who (purchased the product), what (what model was purchased),
when (when the product was purchased, date/time), where (through which regional office the product
was purchased) and how the measure is obtained.
A dimension table has a primary key and a number of attributes. The primary key uniquely identifies
the row in the dimension table and is also used to link to the fact table. The attributes in this table
describe the dimension.
As an example, consider a data mart that analyses the sales data. In a multidimensional model, we
need to create a fact table and different dimension tables, such as product dimension, time dimension,
location dimension and customer dimension. Conceptually, this is very useful because we can analyse
sales data from different views or different dimensions. For example, we can analyse the sales patterns
of different brands/models, sales in different regions based on sales to different customers and sales at
different periods in a year. So, from the dimension tables, you can get the information, such as:
 Persons of what average monthly income purchased a particular model?
 When a particular model’s sales were the highest?
As another example, in a data warehouse for ATMs, the fact table contains the attributes: transaction
amount as the fact and also account number, transaction number and a number of foreign keys. The
dimension tables are transaction dimension, time dimension, location dimension and account dimension.
14
The time dimension is a dimension you will come across in many data marts and data warehouses. The
time dimension can contain the following attributes:
 Time_ID (an integer)
 CalendarMonth (small integer)
 CalendarQuarter (small integer)
 CalendarYear (small integer)
 FiscalMonth (small integer)
 FiscalQuarter (small integer)
 FiscalYear (small integer)
If the calendar year and the fiscal year are the same, then there is no need for separate attributes for
calendar year and fiscal year.
The concept of a surrogate key is important in a data warehouse. Surrogate means ‘deputy’ or ‘substitute’.
Surrogate key is a small integer (say, 4 byte integer) that can uniquely identify a record in a dimension
table. However, it has no meaning. Some data warehouse experts suggest that the production keys
used in the databases should not be used in dimension tables as primary keys. Instead, in their place,
surrogate keys have to be used, which are generated automatically.
12.11 CHALLENGES IN ETL FUNCTIONS

As we perform the ETL process of data in a data warehouse, few challenges are faced in ETL functions.
These challenges are as follows:
 Data in the data warehouses come from diverse data sources like data from non-OLTP systems, such
as text files, legacy systems and spreadsheets. Such data also require proper means of extraction,
transformation and loading.
 The process of copying data from one database to another within or external to the organisation in
its simplest form is known as ETL. It is one of the important and time-consuming processes in the
data warehouse environment.
 Structure of sources changes with time with the use of new technology; hence, mapping with the
data warehouse needs to be taken care of.
 The data type, file formats and naming conventions are different for different types of source systems.
 Data may be inconsistent.
 ETL is a process that takes data from various source systems, transforms the data and ultimately
loads the data into the Data Warehouse system.
 The ETL process, which is technically demanding, necessitates active participation from a variety of
stakeholders, including developers, analysts, testers and senior executives.
 ETL is a periodic function of a data warehouse system that must be agile, automated and properly
documented.
15
JGI JAIN
 Data extraction is the process of collecting data from a database or SaaS platform so that it may be
duplicated to a destination built to allow online analytical processing, such as a data warehouse (OLAP).
 ETL is used to prepare data for analysis or business intelligence (BI).
 The logical extraction methods used for extracting the data logically are known as logical extraction
methods.
 In the online method of extraction, the data deal with the sources directly.
 In offline extraction, the data do not deal with the sources directly, but they are produced outside
the source system.
 To determine which data sources will be used to populate the data warehouse.
 Extraction occurs at the same time as the transactions are taking place at the source database and
files.
 In the deferred data extraction technique, data extraction does not capture the changes in real-time.
 Transformation is the process of applying a set of rules to extracted data before immediately putting
the data from the source system into the destination system.
 Loading is relatively simpler than the other two processes. It is desirable to disable any constraints and
indexes before starting the load process and enable them back only after the load process completes.
 A fact table is central table that contains the measures or facts of a business process.
 A dimension table gives the context of measure/metric indicated in the fact table.
12.13 GLOSSARY
 ETL: A process that takes data from various source systems, transforms the data and ultimately
loads the data into the data warehouse system
 Data extraction: The process of collecting data from a database or SaaS platform so that it may be
duplicated to a destination built to allow online analytical processing
 Transformation: The process of applying a set of rules to extracted data before immediately putting
the data from the source system into the destination system
 Fact table: The central table that contains the measures or facts of a business process
 Dimension table: It gives the context of measure/metric indicated in the fact table

1. Define ETL.
2. List down the need of ETL process.
3. Explain the concept of immediate data extraction.
4. State the concept of Data Loading and its techniques.
16
5. Describe the challenges in ETL functions.

1. ETL is a process that takes data from various source systems, transforms the data (such as applying
computations, concatenations and so on) and ultimately loads the data into the Data Warehouse
system. ETL stands for Extract, Transform and Load. Refer to Section Introduction
2. There are several reasons why a business should implement ETL:
 It assists businesses in analysing their company data to make key business choices.
 Transactional databases, for example, cannot answer complicated business queries that ETL
can.
 A data warehouse serves as a central repository for data.
Refer to Section Need of ETL Process
3. In immediate data extraction, the data extraction is done in real-time. Extraction occurs at the
same time as the transactions are taking place at the source database and files. Refer to Section
Immediate Data Extraction
4. Loading is relatively simpler than the other two processes. It is desirable to disable any constraints
and indexes before starting the load process and enable them back only after the load process
completes. The ETL tool is used to maintain referential integrity and ensure the consistency of data.
Refer to Section Techniques of Data Loading
5. Few challenges are faced in ETL functions. These challenges are as follows:
 Data in the data warehouses come from diverse data sources like data from non-OLTP systems,
such as text files, legacy systems and spreadsheets. Such data also require proper means of
extraction, transformation and loading.
Refer to Section Challenges in ETL Functions
 https://www.astera.com/type/blog/etl-process-and-steps/
 https://medium.datadriveninvestor.com/understanding-extract-transform-and-load-etl-and-its-
necessity-in-data-analytics-world-with-an-64346016153d
 Discuss with your friends the concept of ETL. Also, discuss how to the implement the ETL process in
data analysis.
17
UNIT
13 Business Intelligence (BI)
Names of Sub-Units
BI Architecture, BI in Today’s Perspective, BI Process, Applications of BI, Result Pattern and Ranking
Analysis, Balanced Scorecard, BI in Decision Modelling, Optimisation, and Ethics and BI
Overview
The unit begins by discussing the concept of business intelligence. Next, the unit discusses the BI
architecture and today’s perspective of BI. Further, the unit explains the applications of BI. The unit
also discusses the result pattern and ranking analysis. Then, the unit discusses the balanced scorecard
and decision model in BI. Towards the end, the unit discusses the optimisation and ethics in BI.
Learning Objectives

 Describe the concept of business intelligence (BI)
 Explain the perspective of BI and the BI Process
 Define result pattern and ranking analysis
 Describe the different BI in Decision Modelling
 Explain ethics and BI
JGI JAIN
Learning Outcomes

 Examine the concept of business intelligence (BI)
 Evaluate the BI architecture
 Assess the knowledge about BI process and application of BI
 Analyse the concept of result pattern, ranking analysis and balanced scorecard
 Assess the different BI in Decision Modelling
 https://www.jatit.org/volumes/research-papers/Vol9No1/9Vol9No1.pdf
13.1 INTRODUCTION
Business intelligence (BI) is the process of transforming data from an organisation into insights that
can be utilised to guide business decisions. BI analysts will access and analyse datasets using BI tools,
software or services, and then transform their findings into reports, summaries, dashboards, graphs,
charts or maps.
The introduction of sophisticated data visualisation and reporting technologies in recent years has
altered the discipline, allowing firms to leverage big data insights to find, develop and create new
business prospects.
Business intelligence (BI) is a collection of procedures, structures, and technology that transform raw
data into useful information that drives lucrative business activities. It is a collection of tools and
services designed to turn data into actionable insight and knowledge. Business intelligence (BI) has a
direct influence on an organisation’s strategic, tactical and operational business choices. BI allows for
fact-based decision-making based on previous data rather than assumptions and gut impressions. BI
tools analyse data and generate reports, summaries, dashboards, maps, graphs and charts to offer
users in-depth information on the nature of the business.
13.2 UNDERSTANDING CONCEPT OF BUSINESS INTELLIGENCE (BI)

Business intelligence ideas relate to the use of digital computing technologies like data warehousing,
analytics and visualisation to locate and analyse critical business-based data to develop new, actionable
corporate insights.
BI systems provide present (real-time), historical and predicted views of internally organised data
pertaining to all divisions inside a business, which greatly increases operational understanding and
decision-making.
2
UNIT 13: Business Intelligence (BI) JGI JAIN
Simply said, business intelligence is the act of identifying important trends or patterns in data to make
more efficient, accurate judgments about your company’s goals, objectives and plans.
Because pattern recognition is a critical component of business intelligence, artificial intelligence in

business intelligence plays a critical role in the process. Pattern identification is one of the fundamental
characteristics that differentiate BI specialists from BI novices when addressed appropriately. AI
technologies greatly aid in pattern identification by assisting users in discovering vital insights
autonomously, making the process more natural, simplified and ultimately more accurate.
A variety of business intelligence principles are implemented through accessible, interactive tools and
dashboards, a centralised place that allows you to easily delve deep into your data.
Business intelligence (BI) is a technology-driven process for evaluating data and delivering actionable
information to leaders, managers and employees to help them make educated business choices.
Organisations collect data from internal and external IT systems, prepare it for analysis, run queries
against the data and create data visualisations, BI dashboards and reports to make the analytics results
available to business users for operational decision-making and strategic planning as part of the BI
process.
The ultimate purpose of BI projects is to drive better business choices, allowing firms to grow revenue,
enhance operational efficiency and gain a competitive advantage over their competitors. To do this, BI
combines analytics, data management and reporting technologies, as well as numerous approaches for
organising and analysing data.
13.3 BI ARCHITECTURE
Business intelligence is described as a collection of mathematical models and analysis procedures that
employ existing data to develop information and knowledge that can be used in complicated decision-
making processes. A business intelligence system gives information and knowledge gathered from data
to decision-makers.
Figure 1 shows the architecture of BI:
Logistics
Operational Multidimensional cubes
Systems Data Exploratory data analysis
Marketing Time series analysis Data
Warehouse
mining Optimization
Performance
Evaluation
External Data
ETL Tools
Figure 1: BI Architecture
3
Business Intelligence system includes four major components:

 Data Source: Data is kept in a variety of primary and secondary sources of varying types. The source
comprises structured data from the operational system as well as unstructured documents such as
emails and data received from external providers.
 Data Warehouse and Data Marts: The phrase “data warehouse” refers to the entire collection of
interconnected operations that go into the building, deploying and using a data warehouse. A data
mart is a system that collects all of the data needed by a given firm function, such as marketing or
logistics.
 ETL: The ETL serves three primary purposes. Data extraction, transformation and loading into the
data warehouse
 Extraction: Data is extracted from internal and external sources.
 Transformation: The cleaning and transformation phase’s purpose is to enhance the quality of
data acquired from various sources.
 Loading: Following extraction and transformation, data is loaded into the data warehouse’s
tables.
Data originating from many sources is stored in a database using ETL technologies. These databases
are commonly known as data warehouses and data marts.
 Business intelligence methodologies: Finally, data is collected and used to feed mathematical models
and analytical procedures designed to assist decision-makers. Several decision support applications
may be implemented in a business intelligence system, most of which will be detailed as follows:
 Analysis of multidimensional cubes
 Analysing exploratory data
 Analysing time series
 Data mining inductive learning models
 Optimisation models.
13.4 BI IN TODAY’S PERSPECTIVE

You now understand what Business Intelligence is and what it is capable of, but why is Business
Intelligence so crucial to modern-day enterprises remains a mystery. The following are the primary
reasons to invest in a robust BI strategy and system:
 Acquire new customer insights: One of the key reasons businesses invest time, money and effort
in Business Intelligence is because it allows them to better watch and evaluate current client
purchasing trends. Once you’ve used BI to understand what your customers are purchasing and
why they’re buying it, you can use that knowledge to build goods and product upgrades that match
their expectations and demands and so boost your organisation’s bottom line.
 Increased visibility: Because a BI system improves the visibility of these operations, Business
Intelligent firms have better control over their processes and standard operating procedures.
Skimming through hundreds of pages of yearly reports to gauge performance is a thing of the past.
4
Business intelligence illuminates all parts of your business, making it easier to find opportunities for
development and allowing you to be proactive rather than reactive.
 Useful information: An excellent Business Intelligence system identifies crucial organisational
patterns and trends. A BI system also enables you to comprehend the ramifications of various
organisational processes and changes, allowing you to make educated decisions and take
appropriate action.
 Increased efficiency: BI Systems aid in the improvement of organisational efficiency, which in
turn boosts production and, perhaps, income. Business intelligence platforms enable organisations
to easily communicate critical information across departments, saving time on reporting, data
extraction and data analysis. Making information exchange easier and more efficient allows firms
to reduce duplicate roles and activities, allowing people to focus on their job rather than data
processing.
 Sales intelligence: Both sales and marketing teams want to maintain track of their customers and
the majority of them use the Customer Relationship Management (CRM) program to accomplish so.
CRMs are intended to handle all customer contacts. They include a plethora of data and information
that may be evaluated and used for strategic goals since they hold all consumer contacts and
interactions. BI systems assist businesses in a variety of ways, including discovering new consumers,
tracking and keeping existing ones, and offering post-sale services.
 Real-Time information: When executives and decision-makers must wait for reports to be generated
by several departments, the data is prone to human error and risks becoming obsolete before it
is even presented for evaluation. BI systems give customers real-time access to data through a
variety of tools like spreadsheets, graphic dashboards and scheduled emails. When using Business
Intelligence technologies, large volumes of data can be swiftly and precisely absorbed, evaluated
and delivered.
 Competitive edge: In addition to all of these wonderful advantages, Business Intelligence may assist
you in gaining insight into what your rivals are doing, helping your firm to make informed decisions
and prepare for future ventures.
13.5 BI PROCESS
BI comprises a broad category of applications and techniques that are used for collecting, storing,
analysing and providing easy access to data to help business users in making better business decisions.
Business intelligence structures are developed to use all of your fundamental business information to
assist management in effective decision-making.
Business Intelligence process consists of following tasks:

 Data gathering: Assemble the required information about the company The key is to automate the
procedure. Data collection used to take a long time and cost a lot of money, but now that we have
contemporary computers, it’s much quicker to collect data from a variety of sources.
 Analysis: Analyse the data you’ve gathered and then extrapolate information from it. The
information that has been gathered is subsequently converted into knowledge.
5
 Reporting: The final task is to use the newly gathered knowledge to improve the business and to
prepare a report that helps in making efficient decisions.
13.6 APPLICATIONS OF BI
BI serves in different areas, such as financial, statistical, sales, CRM, etc. The following are the different
areas where the BI is used:
 Application of BI in Financial analysis: The process of evaluating the efficiency and sustainability
of organisations, programs, budgets and other financial processes is known as financial analysis.
Financial analysis is commonly used to determine if a business is stable, solvent, liquid or lucrative
enough to merit monetary investment.
Financial analysis is used to assess economic trends, formulate financial policy, create long-term
business goals and select projects or firms for investment. This is accomplished by combining
financial figures and data. A financial analyst will extensively review the income statement, balance
sheet and cash flow statement of a corporation. Financial analysis may be carried out in both
corporate finance and investment finance contexts. One of the most popular methods of analysing
financial data is to compute ratios from the data in the financial statements and compare them to
those of other firms or the company’s own historical performance.
 Statistical analysis: The collecting and evaluation of data to discover patterns and trends are
known as statistical analysis. It’s a part of data analytics. Statistical analysis may be used for a
variety of purposes, including obtaining research interpretations, statistical modeling and creating
surveys and studies. It can also be beneficial for business intelligence firms that must deal with big
amounts of data.
Statistical analysis in the context of business intelligence (BI) entails collecting and examining every
data sample in a collection of objects from which samples may be derived. In statistics, a sample is
a representative selection derived from a larger population.
The purpose of statistical analysis is to discover patterns. A retail company, for example, can utilise
statistical analysis to discover trends in unstructured and semi-structured consumer data that can
be used to improve the customer experience and enhance sales.
Statistical analysis may be divided into five separate steps, which are as follows:
 Describe the type of data that will be studied.
 Investigate the data’s relationship to the underlying population.
 Develop a model to explain your knowledge of how the data connects to the larger population.
 Demonstrate (or refute) the model’s validity.
 Use predictive analytics to conduct scenarios that will assist in guiding future actions.
 Sales Analysis: Let’s start with a definition of sales analysis. A sales analysis is a comprehensive
study that provides a more in-depth insight into a company’s sales performance, client data and
income. In essence, you may measure and get insights on indicators such as present and historical
sales, upcoming patterns and a plethora of others that are important to your business.
6
It does not have to be boring figures or boring paragraphs. It may be represented graphically using
bar graphs and charts. Here are some other KPIs to consider like Regional Sales, Average Purchase
Value, Sales per Rep, Quote-to-Closure, Cannibalisation rate, Sell-through rate, Lead conversion
rate, Product performance, Sales to date, Sales opportunities, Sales objectives and Sales growth.
 Customer relationship management (CRM): Customer Relationship Management (CRM) refers to
the system of building and nurturing a relationship with customers. It involves the services offered
by the company to its customers. Successful businesses depend on identifying the needs of their
customers and fulfilling them with superior products and services.
Organisations conduct market researche to collect data about the demographics, needs and buying
habits of customers. After analyzing this data, organisations can identify the real requirements
of the customers. This helps them in achieving a competitive advantage and ensuring customer
satisfaction.
Most organisations believe that CRM is all about utilising customers’ data for better marketing.
Organisations often flood customers with advertisements about their products and services without
even knowing their actual requirements. This may upset existing as well as potential customers,
who may refrain from having any further transactions with the organisation. In such a scenario,
organisations require deeper insights and intelligence to improve customer satisfaction. This
knowledge should be applied to understanding the customer’s objective and matching it with the
business objectives, both of which are integral to the success of a business.
Figure 2 shows the uses of a CRM system in an organisation:
Customer Service Executives,

Marketing Executives
Customer’s Details
Marketing Management
Consumer
Feedback
Behavior
Production Management
Personal Service
Details Details Financial Management
Figure 2: CRM System

As seen in Figure 2, with the help of the CRM system, customer service executives can understand
consumer behavior, get personal and service details of customers as well as feedback from them.
Marketing executives can use customer details for managing marketing. Similarly, these details can
be used for managing the production and finance of an organisation.
Most of the current generation of CRM software used for sales and customer services is primitive
and needs to be upgraded with new technologies. There is a requirement for deeper integration and
an intelligent customer service system that fulfils the expectation of the customer, and is not just
7
a system made by stitching the modules of CRM together. The following points briefly explain how
CRM is linked to BI:
 Guidance: BI often provides advice to the CRM department. Suppose a sales manager needs to
determine how to sell a specific product to a particular customer. In that case, he/she can use a
BI tool such as CRM to revise the CRM guidelines that provide assistance to sales representatives
in communicating with the customers.
 Documentation: Sales managers document CRM data and combine it with their BI data. For
example, a sales manager can collect data about different types of customers who purchased
certain products and services. In addition, he/she can analyse this data to understand the best
way of selling products or services to similar customers in the future.
 Provides software and practices of sales managers: CRM usually provides the software
and practices of sales managers to keep track of customer user accounts. Information on the
requirements, sales histories, feedback and contacts of customers might be fed in to the CRM
accounts. BI also helps in the improvement of customer relationships with the help of tools such
as data mining, decision support systems, etc., which enable the analysis of correct information
on customer behavior. These systems also help managers to make informed business decisions.
 Inter-dependence: BI and CRM are dependent on one another as sales professionals often
require information about inter-departmental activities. For example, while working with a new
customer, a salesperson might discover that there is no past information to assist him/her in
determining the most appropriate selling point to use with the customer. In such a scenario,
the sales professional can use BI to understand how to approach the customer on the basis of
data related to other similar customers who are from the same industry, background, income
bracket or region.
 Launch of advertising and marketing campaigns: Another useful association between CRM
and BI facilitates the advertising and marketing department to launch successful campaigns.
Marketing professionals can use CRM data to understand which products or services might
appeal to a particular kind of customer. For example, after researching CRM data, a marketing
professional for a publishing house can use BI tools to know which publications are preferred
by individuals.
 Generate new intelligence: Many professionals comprehend another important association
between CRM and BI in the sense that CRM data can be used to generate new intelligence.
For example, a sales manager may recognize the fact that a particular product can be sold
successfully to people of a certain area. In such a situation, he or she will realize that this data is
valuable for BI. In the future, the sales manager might use this new BI insight to direct sales and
marketing more effectively.
13.7 RESULT PATTERN

A business result pattern is defined as a set of generic solutions that can be created and used in a
particular circumstance (a context) to solve one or more of the underlying difficulties. Result patterns
can be thought of as product prototypes.
8
Result patterns are used to define your business architecture:

 They represent typical answers to typical situations.
 Result patterns aid in the preservation of an architectural style across the organisation.
 They are a straightforward method of documenting events.
We give a few Result patterns that might be used as a starting point:

 The Result pattern of process assessment
 The Result pattern of process feedback
 The Result pattern of interaction between activities
 The Result pattern of business event-result history.
13.8 RANkING ANALYSIS

Ranking analysis is a straightforward job rating approach in which positions are evaluated based on
an informal appraisal of their overall relevance to the firm. This strategy is quick and cheap, but it
becomes harder to maintain as organisations get larger and more complicated. There are two phases
to rank analysis. First, rank ratings are assigned to observations, generally from smallest to greatest
(the largest values given the largest ranks). Then, using the rank scores, test statistics are computed.
The most popular examples are Kruskal-Wallis analysis (1952). The ranking technique is the most
fundamental type of job evaluation approach. The approach entails evaluating each work in relation
to all other occupations, generally based on an overarching criterion such as ‘job complexity.’ Each job
is compared to others as a whole, and this process is repeated until all jobs have been reviewed and
rated. The fundamental benefit of the ranking evaluation approach and the reason it was created in the
first place is the ability to swiftly determine who is at the top of the class in terms of accomplishing the
company’s core goals. When a researcher wants to create some sort of priority among a group of items,
whether they be policies, qualities, organisations, persons or any other topic or feature of interest, they
employ the ranking question-answer structure.
13.9 BALANCED SCORECARD

A balanced scorecard is a management planning framework that firms use to priorities their goods,
projects and operations, communicate their purposes or objectives and organise their daily operations.
Organisations will use the scorecard to study and analyse their programmes’ effectiveness to decide
how successful they were.
The balanced scorecard is a statement that evaluates managerial performance in a structured manner.
Key Performance Indicators (KPIs) are used to assess the executive team’s progress in implementing the
plan and achieving the set objectives. To judge the business’s growth rate and how it compares to its
competitors, performance is measured against established objectives.
Other employees further up in the organisational structure may use the balanced scorecard to
showcase their value to an organisation’s growth or their suitability for promotions and pay raises.
9
JGI JAIN
The fundamental elements of a balanced scorecard are an emphasis on a critical area relevant to the
company and the use of both economic and non-data to make goals.
Figure 3 shows the four perspectives of balanced scorecard:
Balance Scorecard
Business Organisational
Financial Customer
Processes Capacity
Drives strategy, business actions, and the behavior of direct reports
Business Outcomes
Figure 3: Perspective of Balanced Scorecard
The description of these perspectives are as follows:

 Financial viewpoint: Sales, expense and profit data are collected and analysed to determine
performance. The government’s actual currency profits, such as Rupees, the budget fluctuation,
profit declarations and the benefit of the individuals are types of economic measures that could be
used.
The financial goal of a corporation is to ensure that it earns a profit on its investments while also
managing the company’s major risks. Serving the needs of all stakeholders in the company, including
investors, clients and vendors can help achieve the goals.
Shareholders are a vital part of the company since they provide capital; they should be delighted
when the company succeeds financially. They strive to ensure that the company generates steady
revenue and that the organisation achieves goals such as profit maximisation.
 Internal business process viewpoint: The internal operations of a company define how everything
functions. A balanced scorecard sets the criteria and targets that really can assist the firm function
more efficiently in context. In addition, the scorecard aids in evaluating the company’s products or
services and determining if they meet client expectations. Answering the question, “What are we
good at?” is an important aspect of this viewpoint.
The appropriate response can aid the organisation in developing a marketing strategy and pursuing
technologies that contribute to the development of new and better ways to meet client wants.
 Customer viewpoint: The customer viewpoint examines how well the demand response programs
value to the consumer and assesses consumer satisfaction with both the company’s goods and
services. Customer satisfaction is a barometer of a business’s performance. The way a business
treats its consumers has a direct impact on its profitability.
The brand reputation in terms of the competition is taken into account by the balanced scorecard.
How do clients see your business in comparison to your competitors? It allows the company to
10
venture outside of its comfort zone and see itself through the eyes of the client rather than only from
the inside.
Increasing customer satisfaction, boosting the shopping experience for customers and altering the
costs of a company’s primary products are some of the techniques that an organisation can use to
enhance its brand value among customers.
 Organisational capacity viewpoint: Assessments and comments from the employees are developed
to determine upgrading skills and information gaps. Personnel must be trained on the most up-to-
date techniques and equipment to achieve top performance.
Employees are provided opportunities to learn and grow. The assessment is depending on the
number of certificates received and the number of weeks of training done.
Organisational performance is crucial in attaining positive outcomes through optimising goals and
objectives. Employees in the organisation’s departments are supposed to meet high standards in the
context of uniformity, corporate culture, strategic alignment and technical skills.
For the organisation to meet management’s goals, an appropriate structure is needed. For example,
the company should use trying to cut technologies to convert activities and keep processes flowing
smoothly.
13.10 BI IN DECISION MODELLING

Decision Modeling (DM) is a subset of decision management that focuses on expressing business choices
using standardised notation for usage by business analysts and subject matter experts rather than
developers.
Business analysts may use decision modeling to create and share a detailed documentation of how their
business choices to function. A decision model is made up of two views:
 Decision requirements view: The Decision requirements view explains how complicated decisions
are broken down into simpler sub-decisions and what data and business knowledge are needed to
support them. It depicts the decision’s hierarchical structure.
Decision requirements diagrams are an effective tool for demonstrating the integration of business
rules, analytics, optimisation models, business know-how and source data inside an architecture.
They make all of this easier by offering a framework for merging them into a cohesive whole. Subject
matter specialists believe that developing this point of view is an effective approach to eliciting and
solidifying their understanding of critical decision-making issues.
 Decision logic view: The decision logic view defines decision logic. It frequently, but not always, use
Decision Tables to depict how decisions influence their consequences. This enables business analysts
to articulate how decisions are made in their firms. Decision models can also express logic in various
formats (for example, decision trees or text) or connect with other decision-making methods (i.e,
analytics, cognitive models and optimisation engines).
Decision models are about more than simply requirements. Because of the accuracy of the decision
logic perspective, decision models may be immediately performed. As a result, they can be evaluated
11
against real-world business data and even put into production systems without requiring developers
to transform them into code. This model-driven decision architecture technique lowers mistakes and
allows for a quick change in decision-making it provides safe agility for business decision-making
and guarantees that your decision-making specification is always clear and consistent with the
implementation.
13.10.1 Decision Making Under Uncertainty

A decision is the result of a process in which one option is made amongst two or more potential courses
of action for the purpose of achieving a goal.” A decision is an act of selecting in which a manager
creates an opinion about what must be done in a specific scenario. Decision making is the process
through which an individual or organisation chooses a viewpoint or action from a number of choices.
Decision-making involves three characteristics of human behaviour.

 Cognition: Mind activities related to knowledge
 Conation: The mental action represented by terms like willing, desire and aversion.
 Affection: It is the mental state connected with emotion, feeling, mood and temperament.
There are numerous circumstances or contexts in which decisions are formed.

 Certainty uncertainty risk conflict or competition decision making under Uncertainty: When a
choice includes a situation about which the manager does not know, either about the result or the
relative probability of any single outcome, he is said to be functioning under uncertain conditions.
Because the manager lacks knowledge on which to base analysis, the most he can do is recognise
that he has no chance of forecasting the occurrences. In this situation, a variety of criteria have
been presented as a potential decision-making basis.
These are listed below.
 Maximising the highest possible payoff- the highest criteria (optimistic).
 Maximising the smallest possible payoff- the most stringent requirement (pessimistic).
 Minimising the decision-highest maker’s probable regret- The minimax criteria (regret).
 Maximax criteria: The most optimistic decision-maker will use this choice criterion when he
is optimistic about the outcome of events impacting the decision. If this concept is followed, the
management will select the option that provides the best pay-off.
 Maximin criteria: The most pessimistic decision-maker uses this criterion. The management feels
that the worst-case scenario will occur. This pessimism leads to the choosing of the alternative with
the least favorable payout.
 Minimax criteria: The minimising of regret is achieved by using the minimax criterion. Managerial
regret is defined as the reward for each alternative under a competitive action state deducted
from the most favorable payout that is attainable with the occurrence of the specific event. When
management selects options and a condition of nature occurs that does not result in the most
favorable reward, regret occurs.
12
 Insufficient reason criterion: This criterion assumes that without any experience, it is impossible
or unnecessary to assign any probability to the condition of the circumstance. In this instance, too,
probability can be allocated because there is no criterion for doing so. The circumstance is known as
the insufficient reason criteria or the la place criterion.
13.11 OPTIMISATION
In Business Intelligence, business optimisation is the process of increasing an organisation’s efficiency,
productivity and performance. This is true for both internal activities and external goods. Big data
and connected technologies may be utilised to impact new ideas that promote business optimisation,
helping firms to achieve greater success. Improved data modeling, machine learning and artificial
intelligence are assisting firms in identifying best-fit digital solutions and novel techniques that promote
improvement.
The following are important aspects of business optimisation:

 Productivity, efficiency and performance are all measured.
 Identifying areas for development.
 Implementing new procedures and processes.
 Measuring and comparing outcomes
 Repetition of the cycle.
13.12 ETHICS AND BUSINESS INTELLIGENCE

Ethics can be defined as the challenge of determining right from wrong and good from the bad. BI has
raised many ethical questions regarding how to gather information and how to utilize that. We are
moving towards an era in which every BI program should actively manage ethics. More data, different
types of data and advanced analysis of data often raise concerns about data security, privacy, ownership
and anonymity.
These ethical problems arise due to the adoption of BI methodologies, data mining methods, and
decision support systems and should not be ignored. Although the progress in information technology
and knowledge has generated countless opportunities, it has also raised various risks related to data
security. These risks can be prevented and avoided by implementing strict and appropriate control rules
and mechanisms. Improper usage of data by public and private organisations, which do not respect the
right to privacy of individuals, should not be tolerated.
These rules also restrict organisations from taking exclusive and unilateral benefits at the expense of
consumers and workers. However, the possibility of crossing the boundary between the corrective and
intrusive usage of information always remains high, particularly in the field of relational marketing
and Web mining. For example, it is not ethical for organisations to use and circulate any type of
personal information of individuals gathered through websites, links or cookies, with other commercial
organisations without the consent of users. Apart from the respect for the right to privacy, the social
responsibility of enterprises is another ethical issue which has been in discussion by the past few years.
13
There may be a difference in opinion on whether a company should focus on the short-term maximization
of profits, acting solely in the interest of shareholders or on the social consequences of its decisions.
Resolving conflicts related to the use of data in BI requires acknowledgment, discussion and the hard
work of defining ethics-based policies and developing a culture of ethical conduct.
 Business intelligence (BI) is the process of transforming data from an organisation into insights that
can be utilised to guide business decisions.
 Business intelligence (BI) is a collection of procedures, structures and technology that transform
raw data into useful information that drives lucrative business activities.
 Business intelligence ideas relate to the use of digital computing technologies like data warehousing,
analytics and visualisation to locate and analyse critical business-based data to develop new,
actionable corporate insights.
 Business intelligence system gives information and knowledge gathered from data to decision-
makers.
 BI comprises a broad category of applications and techniques that are used for collecting, storing,
analysing and providing easy access to data to help business users in making better business
decisions.
 Customer Relationship Management (CRM) refers to the system of building and nurturing a
relationship with customers.
 A business result pattern is defined as a set of generic solutions that can be created and used in a
particular circumstance (a context) to solve one or more of the underlying difficulties.
 Ranking analysis is a straightforward job rating approach in which positions are evaluated based
on an informal appraisal of their overall relevance to the firm.
 A balanced scorecard is a management planning framework that firms use to priorities their
goods, projects and operations, communicate their purposes or objectives and organise their daily
operations.
 Decision Modeling (DM) is a subset of decision management that focuses on expressing business
choices using standardised notation for usage by business analysts and subject matter experts
rather than developers.
 Ethics can be defined as the challenge of determining the right from wrong and good from the bad.
13.14 GLOSSARY
 Business intelligence (BI): The process of transforming data from an organisation into insights that
can be utilised to guide business decisions.
 Customer Relationship Management: The system of building and nurturing a relationship with
customers.
14
 Balanced scorecard: A management planning framework that firms use to priorities their goods,
projects and operations communicate their purposes or objectives and organise their daily
operations.
 Decision modeling: A subset of decision management that focuses on expressing business choices
using standardised notation for usage by business analysts and subject matter experts rather than
developers.
 Ethics: The challenge of determining the right from wrong and good from the bad.

1. What do you understand by business intelligence?
2. Explains the architecture in business intelligence.
3. Define the balanced scorecard.
4. Discuss ethics and business intelligence.
5. List down the applications of business intelligence.
13.16 HINTS FOR SELF-ASSESSMENT QUESTIONS

1. Business intelligence (BI) is the process of transforming data from an organisation into insights
that can be utilised to guide business decisions. BI analysts will access and analyse datasets using BI
tools, software or services and then transform their findings into reports, summaries, dashboards,
graphs, charts or maps. Refer to Section Introduction
2. Business intelligence is described as a collection of mathematical models and analysis procedures
that employ existing data to develop information and knowledge that can be used in complicated
decision-making processes. A business intelligence system gives information and knowledge
gathered from data to decision-makers. Refer to Section BI Architecture
3. A balanced scorecard is a management planning framework that firms use to priorities their
goods, projects and operations, communicate their purposes or objectives and organise their
daily operations. Organisations will use the scorecard to study and analyse their programmes’
effectiveness to decide how successful they were. Refer to Section Balanced Scorecard
4. Ethics can be defined as the challenge of determining the right from wrong and good from the bad.
BI has raised many ethical questions regarding how to gather information and how to utilise that.
We are moving towards an era in which every BI program should actively manage ethics. Refer to
Section Ethics and Business Intelligence
15
5. BI serves in different areas, such as financial, statistical, sales, CRM, etc. The following are the
different areas where the BI is used:
 Application of BI in Financial analysis: The process of evaluating the efficiency and
sustainability of organisations, programs, budgets and other financial processes is known as
financial analysis. Financial analysis is commonly used to determine if a business is stable,
solvent, liquid or lucrative enough to merit monetary investment.
Refer to Section Applications of BI
 https://www.ibm.com/topics/business-intelligence
 http://ijariie.com/AdminUploadPdf/BUSINESS_INTELLIGENCE__CONCEPTS__COMPONENTS_
TOOLS TECHNIQUES BENEFITS_AND_CHALLENGES_ijariie5711.pdf
 Discuss with your friends and classmates the concept of business intelligence. Also, discuss the
architecture and process of BI.
16
UNIT
14 Databases on the Web and Semi

Structured Data
Names of Sub-Units
Web Interfaces to the Web, Overview of XML: Structure of XML Data, Document Schema, Querying
XML Data, Storage of XML Data, XML Applications, Semi Structured Data Model, Implementation
Issues, Indexes for Text Data, Lab Exercise
Overview
The unit begins by discussing the concept of Web interface to the Web. Next, the unit discusses the
concept of XML. Further, the unit explains the structure of XML data, document schema, querying
XML data, storage of XML data, XML applications. The unit also explains the semi structured data
model. The unit then discusses the implementation issues. Towards the end, the unit discusses the
indexes for Text Data.
Learning Objectives

 Discuss the concept of concept of Web interface to the Web
 Explain the concept of XML and XML document structure
 Describe the storage of XML data, XML applications
 Explain the semi structured data model
 Discuss the concept of indexes for Text Data
JGI JAIN
Learning Outcomes

 Evaluate the concept of concept of Web interface to the Web
 Assess the knowledge about the XML, XML document structure, XML schema
 Evaluate the storage of XML data, XML applications
 Analyse the concept of semi structured data model
 Explore the concept of indexes for text data
 https://matfuvit.github.io/UVIT/predavanja/literatura/TutorialsPoint%20XML.pdf
14.1 INTRODUCTION
A Web database is a database application that can be administered and accessed through the Internet.
Website administrators may control this data gathering and display analytical findings depending on
the data in the Web database application. Databases originally debuted in the 1990s and have proven to
be a valuable tool for organisations, allowing them to collect virtually unlimited amounts of data from
an infinite number of clients.
Semi-structured data is a hybrid of structured and unstructured data that combines the best of both
worlds. It also adheres to a specific schema, is consistent and exists to save up space and provide clarity.
Semi-structured documents include CSV, XML and JSON. NoSQL databases are widely used to manage
semi-structured data.
14.2 WEB INTERFACES TO THE WEB

Your web user interface, often known as a web interface, is the most basic sort of web administration.
This is owing to the point-and-select features, which allow you to get your firewall administration up
and running quickly. Define a web interface, it is a simple interface with configuration choices on the left
side of your browser. Your menu is either DHTML (dynamic hypertext markup language)-based or Java-
based. There is no change in functioning, although there are minor differences in appearance and feel.
To design a web interface, you must first realise that it is by default configured to only function over
your HTTP or hypertext transfer protocol. You may also set your web interface to function over HTTPS
(secure hypertext transfer protocol). This provides you with the necessary method for safeguarding
your online management traffic. With your website interface, you may utilise the majority of common
web browsers, including Firefox, Internet Explorer and Chrome.
Web interface is a method that allows you to interact with the program or material that you are seeing
through your web browser on a distant server. Your web server downloads the content of your web page,
2
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
allowing you to interact with it using your browser. Your browser fulfils the role of a client. Because of
the distributed structure of your web browser, you can save your material on a remote server in addition
to gaining quick content access. Because of these types of connections, a massive quantity of data is now
available. Webmail, instant chat, online shopping, online document sharing and social networking are
among the most commonly used web apps.
14.3 OVERVIEW OF XML

Extensible Markup Language (XML) is an abbreviation for Extensible Markup Language. It is a text-
based markup language based on the Standard Generalised Markup Language (SGML).
Unlike HTML tags, which indicate how to show the data, XML tags identify the data and are used to store
and organise it rather than determining how to display it. XML will not replace HTML in the foreseeable
future, but it will provide new opportunities by incorporating many effective HTML elements.
XML has three key qualities that make it helpful in a wide range of systems and solutions:
 XML is expandable XML allows you to design your own self-descriptive tags or language, that is
appropriate for your application.
 XML transports data but does not deliver it XML allows you to store data regardless of how it will
be presented.
 XML is a public standard XML created by the World Wide Web Consortium (W3C) and is available as
an open standard.
14.3.1 Structure of an XML Document

XML defines certain rules for its syntax that specify how to create or structure an XML document. The
syntax used to create an XML document is called markup syntax. It is used to define the structure of
data in the document. While creating an XML document, you must remember the following points:
 XML documents must have starting and closing tags
 XML tags are case-sensitive
 XML elements must be properly nested
 XML documents must have one root element
 XML attribute’s values must be enclosed in double quotes
The structure of an XML document is given as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>

<Employee>
<FirstName>Ambrish</FirstName>
<MiddleName>Kumar</MiddleName>
<LastName>Singh</LastName>
<Age>25</Age>
<EMPID id="A24"> </EMPID>
</Employee>
3
JGI JAIN
Let’s understand the structure of preceding XML document in the following sections:
 XML declaration
 XML elements
 XML attributes
 XML tree
 XML comments
Let’s discuss these in detail.
Exploring XML Declaration
The XML declaration statement is used to indicate that the specified document is an XML document.
Although it is not required to have an XML declaration; however, it is considered good practice to include
it. The XML declaration statement is the first line in the document, which defines the XML version and
character encoding. The following code snippet shows how to define the XML declaration statement:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
The attributes of the XML declaration statement are described as follows:
 Version: Represents the version of XML. This attribute must be provided in the declaration.
 Encoding: Specifies the character encoding that the document uses. You can specify encoding
through meta-data (data about data) stored in a file. XML allows a document to specify its own
character-set by defining an encoding declaration inside the XML declaration. This is an optional
attribute in the XML declaration.
 Standalone: Contains two values, yes and no. The yes value specifies that the XML document has
an internal DTD. The no value specifies that the XML document is linked to an external DTD or any
external entity references.
Some important points to note about XML declaration are as follows:

 XML declaration starts with <?xml and ends with?>
 XML declaration must include the version attribute; however, the encoding and standalone
attributes are optional
 XML declaration must be at the beginning of the file
 The version, encoding and standalone attributes must be in the order
Exploring XML Elements
An XML element is the basic building block of an XML document. Every XML document consists of XML
elements and the content embedded within these elements. A start tag and an end tag delimit an element
in an XML document. A start tag is delimited by the < and > characters and an end tag is delimited by
the </and > characters.
An example of an XML element is shown in the following code snippet:

<Employees></Employees>
4
In the preceding code snippet, the name of the element is Employees.
A document must have a single root element, which is the top most parent element in an XML document.
Note that a root element does not have a parent element, but can have one or more child elements. The
following code snippet shows a root element:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Employees>
.
.
.
</Employees>
In the preceding code snippet, <Employees> is a root element. You can add any number of elements in
the <Employees> element. The root element is also known as the document element.
Some rules that must be considered while defining an XML element are as follows:
 Element names can start with letters or the underscore (_) character, but not numbers or other
punctuation characters.
 After the first character, you can use numbers and characters, such as hyphen (-) and period (.), in
the element name.
 Element names cannot contain spaces.
 Element names cannot contain the: character because it is a reserved character in the XML
specification. You should avoid using it in your documents unless you are working with namespaces.
 Element names cannot start with the words, such as xml, in uppercase, lowercase or mixed case.
 There cannot be a space after the opening character (<); the name of the element must come
immediately after it. However, there can be space before the closing character (>).
In XML, elements are divided into two basic categories, empty and nested elements. Let’s learn about
these categories in the following sections.
An empty element does not contain any content or any other element within it. It can have attributes
that help in identifying an entity. An empty element can be written without an end tag. The following
code snippet shows how to define an empty element in an XML document:
<items>
<item itemid="2451" />
</items>
In the preceding code snippet, the item is an empty element.
Apart from content, an XML element can also contain other elements. The elements which contain other
elements are known as nested elements.
The following code snippet shows a root element containing a nested element having the content Ambrish:
<Employees>
5
JGI JAIN
<FirstName>Ambrish</FirstName>
</Employees>
In the preceding code snippet, the <FirstName> element is nested in the <Employees> element. The
<FirstName> element contains the content Ambrish. In XML, the elements must be properly nested,
implying the parent element has to be opened before the child element and should be closed after the
child element.
Exploring XML Attributes
XML attributes provide additional information about elements. An attribute is a name-value pair
contained in the element’s start tag, wherein names are separated from values by the assignment
operator. Values of attributes are enclosed either in single quotation marks or double quotation marks.
The following code snippet shows the use of attributes:
<person sex="male" color="dark brown" >
Alan Turing
</person>
In the preceding code snippet, sex and color are the two attributes of the person element. Male is the
value of the sex attribute and dark brown is the value of the color attribute.
Exploring an XML Tree
An XML tree represents the elements of an XML document in a tree structure. The XML tree can contain
the following elements:
 Root element: Contains all other elements and the content of an XML document. In an XML document,
there can be only one root element.
 Parent element: Contains other inherited elements. A parent element can contain multiple child
elements.
 Child element: Refers to an element that is contained in a parent element. A child element cannot
have multiple parent elements. It can only have a single parent.
 Siblings: Refer to the elements that are contained in the parent element.
Let’s now learn about XML comments.
Exploring XML Comments
In an XML document, comments are used to specify some information or remarks. A comment is not
a part of the program; and therefore, it is not parsed by the parser. Similar to HTML, in XML also a
comment begins with <!- - and ends with - ->, as shown in the following code snippet:

Comments can be written anywhere in the XML document, even before the root element. However, you
cannot write a comment inside a tag or another comment.
6
14.3.2 Document Schema

Document schemas are the metadata structure’s top level. They provide the Library Administrator
authority over the papers or files contributed to the library. Document schemas allow you to group
similar files together even if they are stored in different locations across the library. Document schemas
govern how files are uploaded to the library and what metadata is gathered about them.
When uploading a file to the system, the user will utilise the Document Schema field to pick the schema
that will be associated with the document. The schema chosen will decide what metadata is necessary
as well as the format in which the metadata will be stored.
Document schemas should be tailored to your specific business operations. They are either linked to a
certain kind of documents, such as Executive, Compliance or Record or individual document schemas,
such as Well Reports, Minutes and Packet Attachments. When adding or checking a document into the
Library, a Publisher works with these document schemas. See User Roles for additional information on
Publisher security rights.
Providing detailed schema names and descriptions will improve the document management system’s
effectiveness. You cannot reuse a specific schema name after you have generated it, even if you have
erased the original schema from the system owing to File Hold’s retention characteristics. Providing
distinct names for each schema also avoids confusion for system administrators and end-users.
Document schemas control the following document attributes:

 General: Define schema naming, formatting and document numbering rules.
 Schema Access: Specify which groups have access to this schema.
 Metadata: Specify which metadata fields should be applied to a document.
 Workflow: Create a review and approval procedure for a document in this schema.
 Courier: Send papers for review and approval to external persons or inside File Hold users.
 Document Naming: Create naming standards for the documents.
 Auto-Filing: In the library, specify the destination folder.
 Event Schedule: Determine when the document should be converted to a record, archived or deleted.
 Database Lookup: Perform a database lookup for all metadata fields in the schema.
Following steps are used to examine document schemas:

1. Navigate to Administration Panel → Library Configuration → Document Schemas in the Web
Client.
Or
Navigate to Administration → Manage Schemas → Document Schemas in FDA.
The initial letter of the schema name may be used to narrow the list of schemas.
2. Clicking on a connected letter will bring up a list of schemas that begin with that letter.
7
JGI JAIN
14.3.3 Querying XML Data

XML has the potential to go beyond its roots as a document markup language to become the standard
for data transfer on the Internet. The exchange of electronic data is one of the most eagerly anticipated
applications of XML. Electronic data, unlike conventional Web publications, is primarily meant for
machine consumption rather than human consumption. For example, Businesses may share information
about their products and services, which prospective buyers or customers could evaluate and assess
instantly; secret pathways could be used by business partners to share internal process information
among their computer networks; information from similar sources that release relevant data in XML
format, such as market quotations from financial websites and sports scores from news websites, might
be automatically integrated by search robots.
Third parties will have new chances to offer value by integrating, converting, cleaning and aggregating
XML data. It’s not difficult to foresee that once it’s widespread, many information providers would
arrange their outward view as a repository of XML data, regardless of their internal storage systems.
The interchange of data between programs will thereafter be in XML format. So, what is the function
of a query language in this world? One may think of it as a local supplement to browsing, delivering a
more expressive “find” command over one or more retrieved documents. Alternatively, it might act as a
beefed-up version of XPointer, allowing for richer kinds of logical references to sections of documents.
Neither of these techniques of usage is really “database-like.” From the standpoint of a database, an
XML query language’s main purpose is as a tool for a structural and content-based query that allows an
application to extract precisely the information it requires from one or more XML data sources.
Requirements for a Query Language of XML
At this stage, we present features for an XML query language derived from its anticipated usage as a net
query language, as well as an explanation of why each is required:
 Exact semantics: Formal semantics should be present in an XML query language. The formalisation
must be adequate to facilitate XML query reasoning, such as determining result structure,
equivalence and confinement. Query equivalence is required for query optimisation, but question
containment is important for semantic caching or assessing whether or not a push stream of data
may be utilised to answer a certain query.
 Rewritability and optimisations: XML data is frequently created automatically from other forms,
such as relational, object-oriented and special-purpose data. As a result, such XML data will be a
perspective over data in other models. An XML query over that view should be translatable into the
original data’s query language, rather than having to transform the native data to XML format and
then perform a query. Alternatively, when XML data is native and processed by a query processor,
XML queries, like SQL queries over relational data, must be optimisable.
 Operation of queries: An XML query language must allow the following operations: selection
(choosing a document element based on content, structure or attributes), extraction (drawing out
specific components of a document), reduction (removing selected sub-elements of an element),
8
Combination and restructure (creating a new collection of element instances to contain queried
data) (merging two or more elements into one). All of these procedures should be possible with
a single XML query. These actions should not need the use of another language or several XML
queries. One cause is that an XML server may not comprehend the other language, forcing parts
of the intended output to be returned to the sender for full processing. Because some of these
processes significantly reduce data amounts, it is extremely desirable to do them on the server-side
to reduce network needs.
 XML output: An XML query should return XML results. There are several advantages to having such
a closing feature. A single query can be used to construct derived databases (views). Composition
and deconstruction of queries are helped. Applications can’t tell if they’re looking at basic data or a
query result.
 Semantics of composition: Referential transparency is required for expressions in the XML query
language. That is, the meaning of an expression should be consistent regardless of where it occurs.
Furthermore, expressions with the same result type should be permitted to exist in the same contexts.
Wherever an XML term is anticipated, an expression yielding an XML term should be permitted.
SQL, however, does not meet this latter condition, much to the cost of those who must implement
and use it.
 No schema is necessary: When no schema (DTD) is known in advance, an XML query language
should be useable on XML data. Because XML data is inherently self-describing, an XML query
should be able to depend on such “just-in-time” schema knowledge in its evaluation. This means
that XML queries may be performed against an XML source even if the exact structure of its pages
is unknown.
 Make use of available schema: When DTDs for a data source is given, it should be feasible to
determine if an XML query is appropriately structured in relation to the DTDs and to compute a DTD
for the result. This functionality detects issues at build time rather than run time and provides a
more user-friendly interface for programs to edit query results.
 Maintain order and association: If necessary, XML queries should be able to retain the order and
relationship of items in XML data. The order of items in an XML document might include crucial
information, which a query should not lose. Similarly, the arrangement of sub-components inside
elements is frequently important. For instance, suppose an XML query extracts.
 Embedding with XML mutually: XML queries and XML should be mutually embedded. That is, an
XML query should be capable of containing arbitrary XML data and an XML document should be
capable of including arbitrary inquiries. The latter feature enables XML documents to include both
stored and virtual data. The former enables an XML query to contain arbitrary constants and allows
for partial assessment of XML queries. Partial assessment is helpful in a distributed context when
data from one source is delivered to another and mixed with data from that source.
 Support for new datatypes: An XML query language should have an extension mechanism for
data type-specific conditions and operations. One example is the use of specialised processes to pick
various types of multimedia information.
9
JGI JAIN
 Metadata-friendly: The XML query language should be beneficial in metadata descriptions. A

metadata transfer format for data warehousing transformations or business rules, for example,
might include query components. It would be preferable if the XML query language could be utilised
in such circumstances rather than establishing a new query language. Another major application
of metadata might be in combination with XML to indicate data model limitations.
 Processing on the server: An XML query should be capable of being processed on the server. As
a result, an XML query should be self-contained and not reliant on resources in the context of
its production for assessment. While an XML query may include variables from a local context,
there should be a “bound” form of the XML query that can be run remotely without requiring
communication with its point of origin.
 Manipulation of programs: Programs should be able to create and manipulate XML queries. The
vast majority of queries will not be written directly by users or programmers. Instead, they will be
built via user interfaces or tools in application development environments.
 Representation in XML: An XML query language should XML-representable. While an XML query
language may have more than one syntax, one of them should be as XML data. This characteristic
implies that extra procedures beyond what is necessary for XML itself are not required to store and
deliver XML queries.
14.3.4 Storage of XML Data

The storing of XML data is required by many applications. One method is to store XML data as documents
in a file system, while another is to create a special-purpose database for storing XML data. Converting
the XML data to a relational form and storing it in a database is also another option. The following are
the different ways of storing XML data:
 Nonrelational database
 Relational database
 SQL/XML
Nonrelational Database
In nonrelational data-storage systems, there are numerous options for preserving XML data:
 Store in flat files: Because XML is basically a file format, a flat file is a suitable storage technique.
Many of the disadvantages of using file systems as the foundation for database applications
apply to this technique. It lacks data separation, atomicity, concurrent access and security, among
other things. However, due to the widespread availability of XML tools that interact with file data,
accessing and querying XML data stored in files is quite simple. As a result, for some applications,
this storage format may suffice.
 Create an XML database: Databases that employ XML as their primary information model are
known as XML databases. Initial, XML databases used a C++-based object-oriented database to
handle the DOM. This enables the utilisation of most of the object-oriented database structure despite
maintaining a conventional XML interface. Declarative querying is enabled with the inclusion of
10
XQuery or other XML programming languages. Other systems have developed the complete XML
storage and querying architecture on top of a transactional storage manager.
Relational Database
Because relational databases are commonly utilised in existing applications, there is indeed a
substantial advantage to storing XML data in relational databases such that current apps can access
the information.
If the data were collected from a relational model in the first place and XML is just seen as a data
communication medium for relational data, transforming XML data in to the relational structure is
usually simple.
However, there are several cases in which the XML data is not created from a relational model and it
may be difficult to convert the data to relational form for storage. The storing of XML data in relational
format is complicated by nested elements and elements that repeat.
Some options for storing XML data in relational database are as follows:
 Store as string
 Tree representation
 Map to relations
 Publishing and shredding XML data
 Native storage within a relational database
SQL/XML
While XML is commonly utilised for information transfer, database systems are still widely utilised to
hold structured data. It’s common to need to transform relational data to XML format. To satisfy this
need, the SQL/XML principle provides a solid expansion of SQL that allows for the production of nested
XML output. A standard manner of mapping SQL types to XML Schema types, a common method of
mapping relational schemas to XML schemas and SQL query language extensions are all included in
the standard.
14.3.5 XML Applications

XML applications are programs that process and alter data using XML technologies such as XML,
XSLT, XQuery, XML Schema, XPath, Web services and so on. Stylus Studio already has numerous
straightforward tools for dealing with all of the above and now, with an XML pipeline, you can create a
whole XML application from beginning to end! For example, you may visually determine the sequence in
which various XML processing processes should occur and you can even debug and deploy the complete
program to your production environment in minutes. In this lesson, we’ll look at how to use an XML
pipeline to create an example XML application.
We’ll create an order report in the following sample XML application. This will entail some XML
processing, such as performing different XML operations (converting, parsing, validating, transforming
11
JGI JAIN
and publishing XML) on a variety of data sources. This page displays the order report XML application,
as shown in Figure 1:
Figure 1: Order Report XML Application
The following stages are involved in the development of the XML application:
1. Creating a book catalogue from a text file.
2. Obtaining an Order from an EDIFACT file
3. Extracting order information with XQuery.
4. Using XSLT to generate an HTML order report
5. Creating an XSL: FO stylesheet with XQuery.
6. Using XSL: FO to generate a PDF order report.
14.4 SEMI STRUCTURED DATA MODEL

The semi-structured data model is intended to be a development of the relational data model, allowing
data to be represented with a flexible structure. Some things may have missing characteristics, while
others may have additional attributes and still, others may have two or more repetitions of the same
attribute. An attribute’s type is similarly flexible; it might be an atomic value or another record or
collection. Furthermore, collections can be heterogeneous, containing elements with varying structures.
The semi-structured data model is a self-describing data model that includes both data values and
schema components. Semi-Structured Data may be described as information that has been turned into
a very inexpensive form for translation or processing. Data, such as video, photos, sounds and text, is
represented as binary values that are either 0 or 1. Patterns are constructed using these two integers to
hold various forms of data. A bit is the smallest unit of data in a computer system and it represents a
single value. A byte is made up of eight binary digits.
12
Data is information that has been turned into binary digital form in today’s computers and transmission
devices. The volume of data created has risen dramatically over the previous decade as the number of
computer users has expanded. As a result, a new phrase has been developed to describe the massive
amount of data that is being generated at an alarming rate. It’s referred to as big data. It is not just the
volume of data that has grown over time.
The kind of data being created is continuously rising in tandem with the amount. As a result, categorising
the sorts of data collected becomes critical. A massive quantity of data is created in this day and age of
the internet. This data can take the form of text, photographs, movies, documents, pdf files, videos, log
files and many more formats.
Let us now categorise this massive quantity of data into the main categories listed below:
 Structured Information: Structured data is distinguished from semi-structured data. It is
information that has been purposefully constructed to be easily searchable, quantifiable and well-
organised. It is often kept in relational databases (RDBMS) and expressed in structured query
language (SQL); a standard language established by IBM in the 1970s for interfacing with databases.
Structured data can be entered by humans or robots, but it must comply with a tight framework
with established organisational qualities. Consider a hotel database that can be searched using
parameters such as guest name, phone number, room number and others. Excel files containing
data neatly arranged into rows and columns, for example.
 Semi-structured Information: Semi-structured data is a sort of structured data that does not
adhere to the tabular structure associated with relational databases or other types of data tables
but includes tags or other markers to segregate semantic pieces and impose hierarchies of records
and fields within the data. As a result, it is also known as a self-descriptive structure.
Entities of the same class may have various features while being grouped close to each other in
semi-structured data and the order of the attributes is immaterial.
Semi-structured data has grown more frequent with the advent of the internet, as full-text texts and
databases are no longer the only forms of data. A medium for information sharing is required by
many applications and semi-structured data is frequent in object-oriented databases.
Emails, for example, are semi-structured by Sender, Recipient, Subject, Date and so on or are
automatically categorised into categories like Inbox, Spam, Promotions and so on using machine
learning.
 Semi-structured data is a cross between images and videos: It may contain Meta tags referring
to the place, date or person who took them, for example, but the information contained inside
them is unstructured. Consider Facebook, which organises information by Users, Friends, Groups,
Marketplace and so on, but the comments and content inside these groupings are disorganised.
Because it has a little higher level of organisation, semi-structured data is easier to examine than
structured data. It must, however, be broken down using machine learning technologies before
it can be examined without human assistance. It also incorporates quantitative data, which, like
completely unstructured data, may give far more meaningful insights.
 Unstructured Information: Unstructured data, on the other hand, is often open text, photos, videos
and other material with no preset arrangement or design. Consider internet reviews, documents and
13
JGI JAIN
other qualitative data sources on opinions and sentiments. This data is more difficult to examine,
but it can be organised to extract insights using machine learning techniques; however, it must first
be structured for computers to analyse it.
14.5 IMPLEMENTATION ISSUES

Another possible answer to the semi-structured data problem is to use older technologies. After the W3C
announced XML in 1998, academic work on semi-structured data was nearly suspended, since there was
anticipation that XML would be the answer. XML is now generally recognised and adopted by a variety
of groups for a variety of reasons, but it is now evident that while XML addresses a significant number
of scheme-related difficulties, it does not solve the overall problem of semi-structured information.
XML refers to the complete W3C standard XML infrastructure, which includes a collection of uniform
technologies and languages.
Info set, XQuery 1.0 and XSLT 2.0 are examples of abstract data models, as are XML Schema and declarative
processing languages such as XPath, XQuery and XSLT.) As a standard syntax for information, XML can
model the whole range of information, from completely organised data (i.e., bank account information)
to natural language. Having a single model for the whole spectrum of information provides enormous
advantages for modelling, storage, indexing and automatic processing. There is no need to go from
system to system and have incompatible systems communicate with one other when the amount of
structure in information is increased or decreased. Another significant benefit of XML is the ability to
represent heterogeneous content. Having an abstract information model that extends beyond the entity-
relationship model allows you to model a significant amount of data that was previously hard to model.
The fact that XML schemas are independent of data is also critical for data and schema evolution; data
can exist with or without schemas or with numerous schemas. After the data has been created, schemas
may be added; data and schema generation can be freely interleaved. While XML-based solutions
offer substantial benefits for handling semi-structured data, they are not a panacea in their current
form or isolation.
The majority of the material is still not in XML format and some of it will never be. The benefits of
XML (i.e., complex schemas, mixed context, schema independent data) inevitably add complexity and
problems. Finally, XML-related technologies do not now provide a full answer. While XSLT (Extensible
Stylesheet Language) and XQuery are strong query and transformation languages, there is still no
acceptable means of expressing imperative logic over such schema adaptable data or language to define
complicated integrity restrictions and assertions.
Such restrictions will be lifted eventually, but not immediately. A broad solution to the problem of semi
structured information will require concepts and techniques from a variety of domains, including
knowledge representation, XML and markup documents, information retrieval, the Semantic Web
and traditional data management approaches. There is no one-size-fits-all solution to this situation.
One aspect of the semi-structured data problem that search engine crawlers will help with is human
biological processing. Basic textual search results can be achieved by using situational, semantic and
structural information.
14
Decoupling data from schemas has a significant influence on all areas of data processing, including
storage, indexing, querying and updating, transactional support and so on. To achieve performance for
such jobs, most existing approaches rely on static schema information. Such strategies must be revisited
to ensure their validity and performance, even in the absence of schema information or with continually
developing schema information (Daniela, 2005). Data has an uneven and incomplete structure in most
cases. Some sources contain an implicit data structure, making it difficult to determine the link between
data. Schema and data are often closely connected, which means they are not just related but also
dependent on one another. The same query may change both the schema and the data, with the schema
being updated on a regular basis. The line between schema and data is highly hazy or ambiguous.
This complicates the construction of data structure. When compared to structured data, storage
costs are significant.
14.6 INDEXES FOR TEXT DATA

Text indexing is a stage in the preparation of text retrieval. Texts are gathered, processed and saved
throughout the text indexing process to allow for rapid and accurate text retrieval. Text retrieval (also
known as document retrieval) is a subset of information retrieval in which information is largely stored
in the form of text. Text retrieval is described as matching a user query against a set of texts. Texts are
rated and displayed to the user as a result of text retrieval based on their relevance to the user query.
User inquiries might range from a few words to multi-sentence complete explanations of the user’s
information requirement.
14.7 LAB EXERCISE

1. Design XML Schema for Company Database and implement the queries.
Ans. The following code shows the company.xml file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<EmployeeData Dataset="889923"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="EmployeeData.xsd">
<CompanyName>Yash Gupta</CompanyName>
<Employee>
<EmployeeFirstName>Ritesh</EmployeeFirstName>
<EmployeeLasttName>Sharma</EmployeeLasttName>
<EmployeeAddress>Hudson Lane, Delhi</EmployeeAddress>
<EmployeeDesignation>Sales Manager</EmployeeDesignation>
</Employee>
<Employee>
<EmployeeFirstName>Devesh</EmployeeFirstName>
<EmployeeLasttName>Gaud</EmployeeLasttName>
<EmployeeAddress>Laxmi Nagar, Delhi</EmployeeAddress>
<EmployeeDesignation>HR Manager</EmployeeDesignation>
</Employee>
</EmployeeData>
15
JGI JAIN
The following code shows the EmployeeData.xsd File

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="EmployeeData">
<xs:complexType>
<xs:sequence>
<xs:element name="CompanyName" type="xs:string"/>
<xs:element name="Employee" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="EmployeeFirstName" type="xs:string"/>
<xs:element name="EmployeeLasttName" type="xs:string"
minOccurs="0"/>
<xs:element name="EmployeeAddress" type="xs: string "/>
<xs:element name="EmployeeDesignation" type="xs: string
"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="Dataset" type="xs:string" use="required"/>

</xs:complexType>
</xs:element>
</xs:schema>
The output of the company.xml file is as follows:
16
 A Web database is a database application that can be administered and accessed through the
Internet.
 Semi-structured data is a hybrid of structured and unstructured data that combines the best of
both worlds.
 Web interface is a method that allows you to interact with the program or material that you are
seeing through your web browser on a distant server.
 Extensible Markup Language (XML) is an abbreviation for Extensible Markup Language. It is a text-
based markup language based on the Standard Generalised Markup Language (SGML).
 XML defines certain rules for its syntax that specify how to create or structure an XML document.
 The XML declaration statement is used to indicate that the specified document is an XML document.
 An XML element is the basic building block of an XML document. Every XML document consists of
XML elements and the content embedded within these elements.
 An attribute is a name-value pair contained in the element’s start tag, wherein names are separated
from values by the assignment operator.
 An XML tree represents the elements of an XML document in a tree structure.
 Document schemas are the metadata structure’s top level. They provide the Library Administrator
authority over the papers or files contributed to the library.
 XML has the potential to go beyond its roots as a document markup language to become the standard
for data transfer on the Internet.
 The storing of XML data is required by many applications. One method is to store XML data as
documents in a file system, while another is to create a special-purpose database for storing
XML data.
 XML applications are programs that process and alter data using XML technologies such as XML,
XSLT, XQuery, XML Schema, XPath, Web services and so on.
 The semi-structured data model is intended to be a development of the relational data model,
allowing data to be represented with a flexible structure.
 Text indexing is a stage in the preparation of text retrieval.
14.9 GLOSSARY
 Web database: A database application that can be administered and accessed through the Internet
 Semi-structured data: A hybrid of structured and unstructured data that combines the best of
both worlds
 Web interface: A method that allows you to interact with the program or material that you are
seeing through your web browser on a distant server
17
JGI JAIN
 Extensible Markup Language (XML): A text-based markup language based on the Standard
Generalised Markup Language (SGML)
 Text indexing: A stage in the preparation of text retrieval

1. What do you understand by XML?
2. Define document schema.
3. What exactly is semi-structured?
4. What is Web interface?
5. Explain the concept of storage of XML data.

1. Extensible Markup Language (XML) is an abbreviation for Extensible Markup Language. It is a text-
based markup language based on the Standard Generalised Markup Language (SGML). Unlike HTML
tags, which indicate how to show the data, XML tags identify the data and are used to store and
organise it rather than determining how to display it. XML will not replace HTML in the foreseeable
future, but it will provide new opportunities by incorporating many effective HTML elements. Refer
to Section Overview of XML
2. Document schemas are the metadata structure’s top level. They provide the Library Administrator
authority over the papers or files contributed to the library. Document schemas allow you to group
similar files together even if they are stored in different locations across the library. Refer to Section
Overview of XML
3. The semi-structured data model is intended to be a development of the relational data model, allowing
data to be represented with a flexible structure. Some things may have missing characteristics,
while others may have additional attributes and still, others may have two or more repetitions of
the same attribute. Refer to Section Semi Structured Data Model
4. Web interface is a method that allows you to interact with the program or material that you are
seeing through your web browser on a distant server. Your web server downloads the content of
your web page, allowing you to interact with it using your browser. Refer to Section Web Interfaces
to the Web
5. The storing of XML data is required by many applications. One method is to store XML data as
documents in a file system, while another is to create a special-purpose database for storing XML
data. Converting the XML data to a relational form and storing it in a database is also another
option. Refer to Section Overview of XML
18
 https://www.google.co.in/books/edition/Learning_XML/Zilck1_0c5QC?hl=en&gbpv=
1&dq=XML+book&printsec=frontcover
 Discuss with friends and colleagues the concept of XML. Also, discuss the semi structured data
model.
19
UNIT
15 Emerging Database Trends
Names of Sub-Units
Advanced Database Models, Spatial Database, Deductive Database, Temporal Database, Multimedia
Database and Sensor Database System
Overview
The unit begins by discussing the concept of emerging database trends. Next, the unit discusses the
advanced database model and spatial database. Further, the unit explains the concept of deductive
database and temporal database. The unit also explains the multimedia database. Towards the end,
the unit discusses the sensor database system.
Learning Objectives

 Discuss the concept of emerging database trends
 Explain the concept of advanced database model and spatial database
 Describe the deductive database and temporal database
 Explain the multimedia database
 Discuss the concept of sensor database system
JGI JAIN
Learning Outcomes

 Evaluate the concept of emerging database trends
 Assess the knowledge about the advanced database model and spatial database
 Evaluate the deductive database and temporal database
 Analyse the concept of multimedia database
 Explore the concept of sensor database system
 http://cis.csuohio.edu/~sschung/IST331/Coronel_PPT_Ch05.pdf
15.1 INTRODUCTION
Computers have been around for a long time. Every day, new technology and projects arise. We must
keep an eye on upcoming trends to grasp present technology and have a better understanding of the
advancements around us. Almost every day, new technologies are introduced. Some of these fails and
fade away over time. Some of these innovative technologies thrive and survive over time, earning user
attention. Emerging trends are cutting-edge technologies that gain popularity and establish a new trend
among consumers. This unit will teach us about some developing trends that will have a significant
influence (in the future) on the digital economy and engagement in digital communities.
15.2 ADVANCED DATABASE MODELS

A Data Model is similar to a plan for a structure created by an engineer. It is a method for presenting
intricate existing software concepts in the form of a simple diagram. The graphic will be created using
text and data to explain how the data would move. It is frequently referred to as the design guide for
creating new software or redesigning current programmes.
Advanced modelling can be used by the application developer to include components that are not built
into the programme by default. This can help in understanding the views, attributes, and authorisation
structure of the project.
The advantages of the advanced database models are as follows:

 With the use of data modelling tools, businesses can try to standardised data definitions, language,
concepts and formats.
 Better technologies are produced as a result of collaboration between the data management team
and the business stakeholders, which is required for data modelling.
2
UNIT 15: Emerging Database Trends JGI JAIN
 By giving database designers a specific framework to work from, data modelling speeds up their
work and reduces the likelihood that they will make mistakes that require modifications later on.
 Effective data modelling allows businesses to make effective usage of existing data, which could also
lead to higher productivity, business growth and a competitive edge over opponents.
15.2.1 Relation Model

The relational model, in which all data is specified in terms of tables and columns, is the most widely
known and utilised data model today. RDM enables the definition of a database in SQL, the popular
relational database language. A pure relations model compares column values in one table to common
values in another one to establish relationships. A common method for improving comparability is
indexing. The B+ tree, AVL tree, R tree and Hash index are all implemented in RDM.
15.2.2 Network Model

A network model exists beneath the relational model in an RDM database, where all data is specified
in terms of record types and fields. Fields can be indexed, and record types can have set relationships
specified as one-to-many, owner/member relationships. Set relationships take up space in the records
that are kept in data files. The owner record will include links to the member records. Pointers to the
owner record, as well as the next and prior members, will be included in member records. This enables
rapid traversal among set members and is essentially very similar to a doubly-linked list data structure.
To express relational equal-joins, RDM uses the set concept, as shown in Figure 1:
Owner 4
Members
Member Member Member Member
Figure 1: Relational Equal-Joins
15.3 SPATIAL DATABASE

Spatial data is linked to geographic places such as cities, villages, and so on. A spatial database is
designed to store and retrieve data that represent things. These are the items that have a geometric
definition.
3
JGI JAIN
Many of the disadvantages of static data files are addressed by spatial databases. Spatial databases may
store a vast quantity of data in several tables using connecting techniques that ensure data integrity.
They can impose data input constraints to limit the acquisition of conflicting data. As they expand
in size, they can span numerous physical devices and keep duplicates of themselves for redundancy.
Geographical databases may also keep track of who is making modifications to a collection of spatial
data and track who is making them for permission and auditing. It is critical to distinguish between
traditional databases and spatial databases. Standard databases that have been enhanced to handle
geographic data types and queries are known as spatial databases. Geometric feature geometry that
represents form and position is stored in spatial data types.
The geometry of spatial features is compressed and saved in a binary field with the attribute data
that describes the feature. Furthermore, the database application code is enhanced so that ordinary
searches using alphanumeric characters and logical operators may make use of position, proximity,
and topology. In a normal customer database, for example, a corporation may query for all clients
whose last name begins with a specific letter. In a spatial database, users may search for all consumers
within a certain radius of a business or locate clusters of customers. These kinds of spatial interactions
aren’t seen in most databases.
Spatial databases existed long before the Web, but they were frequently concealed behind firewalls
or authentication, making them inaccessible to the majority of users. Spatial databases may become
targets for hackers since they are often located on powerful servers that are always linked to the
Internet. Malicious users may attempt to get unauthorised access to the database notwithstanding
user authentication and read-only access by exploiting flaws in the database or operating system
programming code. That was the case when SQL Slammer utilised a buffer overflow attack to seize
control of over 75,000 SQL servers, bringing the Internet to a halt (Microsoft, 2003).
Database options from traditional software providers like Microsoft, Oracle and IBM may all be utilised
for computing and GIS applications. Database software is sometimes too expensive to implement in
terms of both the cost of the program and the work required to implement it. Open-source software
has immensely helped GIS software, and spatial databases are no exception. Open-source software
provides various free and powerful geographic database choices for academics testing applications
on a shoestring budget. PostgreSQL, together with PostGIS, is the most widely used open-source spatial
database. A more adaptable “NoSQL” database is a new database trend.
These databases, sometimes known as “document databases,” store data in a manner similar to a JSON
file. These databases lack the schemas that ensure data integrity in traditional databases. Schemeless
databases are less fundamentally strict and offer more flexibility in terms of the data they can retain.
This flexibility enables the storing of disparate data but also necessitates that query be constructed
to deal with disparate data. MongoDB, the most popular schemeless open-source spatial database
platform, provides a growing range of spatial features.
In general, geographical data may be classified into two types:

 Vector data: This type of information is represented by discrete points, lines and polygons.
 Raptor data: This data is displayed as a square cell matrix.
4
The spatial data in the form of points, lines, polygons, etc. is used by many different databases as shown
in Figure 2:
ORACLE
POINTS
SQL SERVER
SPATIAL
LINES
DATA
SYBASE
POLYGONS
DB2
Figure 2: Spatial Data
15.4 DEDUCTIVE DATABASE

A deductive database contains not just facts but also general principles, which may be used to derive
new facts that are not explicitly reflected in the database; data can therefore be stored implicitly.
Take a look at the following deductive database: “Bill’s father’s name is John.” “Derek’s father is Bill.” “IF
x is y’s father AND y’s father is z’s father, THEN x is z’s grandpa.”
In addition to the two facts, this database includes a rule that identifies a pair of entities’ grandfather
connections.
The database can verify the assertion “John is Derek’s grandpa” by substituting x = “John,” y = “Bill,” and
z = “Derek” into the rule. Deductive Database u This type of rule has the effect of defining new relations
that are not explicitly recorded in the database. These are referred to as implicit or virtual relations.
Base ties are explicit relations, such as the father-son relationship described above. The set of virtual
relations is referred to as the intentional database (IDB), while the set of base relations is referred to as
the extensional database First-order logic underpins deductive database systems. With first-order logic,
we can represent both facts and rules about facts using the same vocabulary (EDB).
One language for expressing logical propositions is first-order predicate calculus. For example, the
representation of the database above in first-order logic would be:  father (John, Bill)  father
(Bill, Derek) (x y z) (father (x, y)  father (y, z)  grandfather (x, z)) where ,  and  represent
conjunction, implication and the quantifier “for all” respectively. A deductive database is an advanced
database that has been enhanced with an inference system. Deductive databases arose from the need
to merge logic programming with relational databases to build systems that support a sophisticated
formalism.
The immediate benefit is that it may lower the amount of space required to store data. In a deductive
database system, rules are often specified using a declarative language a language in which we express
what we want to do rather than how we want to achieve it. By interpreting these rules, an inference
5
engine (or deduction mechanism) within the system can deduce new facts from the database. Deductive
databases have not found widespread adoption outside academia, but some of their concepts are used
in today’s relational databases to support the advanced features of more recent SQL standards.
15.5 TEMPORAL DATABASE

A temporal database contains information about certain points in time. It provides temporal data types
and retains information about the past, present and future. Databases often only include information
about the present situation and not so much about earlier situations.
Information concerning changes in the real world throughout time is kept in temporal databases. A
database with built-in capabilities for managing time-related data is known as a temporal database. It
maintains data on all occurrences’ past, present and future times.
The temporal aspects more explicitly include:

 Valid time: Refers to the span of time when the truth is true in the real world.
 Transaction time: Refers to when a fact was recorded in the database.
 Choice time: Refers to the moment when a decision regarding a fact was made.
The different types of temporal databases are as follows:

 Uni-temporal: A uni-temporal database contains a single time axis, which can be either the validity
range or the system time range.
 Bi-temporal: A bi-temporal database has two-timed axes:
 Validity period
 Transactional or decisional time
 Tri-temporal: A tri-temporal database contains three-time axes:
 Validity period
 Transaction period
 Decision period
This method creates more difficulties.
Temporal databases, as opposed to current databases (not to be confused with presently accessible
databases), hold only facts that are thought to be true at the moment.
Temporal databases facilitate the management and access to temporal data by including one or more
of the following features:
 A time period data type, which includes the ability to describe time periods that have no end date
(infinity or forever).
 The ability to establish valid and transactional time period properties, as well as bi-temporal
relationships.
 Transaction time is kept by the system.
6
 Temporal primary keys with non-overlapping time limitations.

 Time limitations, such as non-overlapping uniqueness and referential integrity.
 Automatic splitting and merging of time span for updating and deleting temporal data.
 Temporal inquiries at the present moment, time points in the past or future or throughout time
spans.
 Time period query predicates, frequently based on Allen’s interval relations.
15.6 MULTIMEDIA DATABASE

The term “multimedia database” refers to a set of associated multimedia content that may comprise text,
graphics, photos, animations, video, audio and other types of data from numerous sources. Multimedia
database management system is a structure that organises various forms of multimedia content that
can be provided, saved, and used in many ways.
The multimedia database is divided into three subcategories:

 Dimensional media
 Dynamic media
 Static media
The multimedia database contained multimedia data as well as information. The details are provided
as follows:
 Media data: Refers to multimedia data, which includes photographs, movies, audios, animation, etc.
 Media format data: The media format data includes structuring details for the media data, including
sampling rate, refresh rate and encoding technique, among others.
 Media keyword data: This includes keyword information relevant to the database’s media content.
The keyword information for an image may include the image’s date, duration, summary, etc.
 Media feature data: The media feature data provide a description of the multimedia data’s
characteristics. The colours and patterns in an image are examples of feature data for that image.
Multimedia databases still face a number of difficulties, including the following:

 Modelling: Documents are a particular field that requires unique consideration because functioning
in this sector can enhance databases rather than information gathering techniques.
 Design: Multimedia databases’ theoretical, analytical, and structural architecture have yet to be
properly resolved, as efficiency and optimisation concerns at each level are significantly more
complex, as typically includes a range of different file formats, GIF, PNG and MPEG, which are hard
to adapt from one format to another.
 Storage: The issue of description, encoding, mappings to device architectures, preservation and
buffering throughout input-output execution arises when storing multimedia databases on any
regular disc. Untyped bitmaps can be saved and retrieved using DBMS’s “BLOB” (Binary Large
Object) feature.
7
JGI JAIN
 Performance: Physical restrictions rule in applications involving video streaming or audio-video

synchronisation. Although such approaches have not yet reached their full potential, the utilisation
of parallel processing may be able to solve some issues. Aside from this, multimedia databases use
a lot of bandwidth and computing power.
 Queries and retrieval: Receiving information through a query for multimedia content, such as
images, videos and audio raise a number of challenges that must be addressed, including effective
query formulation, query operation and query optimisation.
The multimedia database is used in the following areas:

 Management of documents and records: Industries and organisations that maintain a wide range
of documents and thorough information.
 Knowledge dissemination: In terms of offering a variety of materials, multimedia databases are a
very successful instrument for knowledge distribution.
 Training and education: Multimedia learning resources, which are currently very prominent
educational tools, can be used to create computer-aided educational materials.
 For instance, digital libraries.
 Control and observation in real-time: Information presented through multimedia formats may be
monitored and controlled very effectively when active database technology is used.
15.7 SENSOR DATABASE SYSTEM

A sensor database contains both stored and sensor data. Stored data comprise the set of sensors that
participate in the sensor database, as well as sensor attributes (i.e, location) or physical environment
parameters. Relationships are the best way to express these stored facts. The question is, how should
sensor data be represented? To begin, signal processing functions create sensor data. Second, the
representation we chose for sensor data should make it easier to formulate sensor queries (data
collection, correlation in time and aggregates over time windows).
It is important to note that time is an important factor. Signal processing routines may return output
repeatedly throughout time, with each output having a timestamp. Furthermore, monitoring queries
impose limits on sensor data time-stamps; for example, Query 3 in Example 1 implies that the aberrant
temperatures are identified either immediately or within a specific time range. Aggregates across time
periods, such as Query 4 and 5, make explicit use of time.
We describe sensor data as a time series given these limitations. Our representation of sensor time series
is based on Seshadri et a sequence’s model. A sequence is described informally as a 3-tuple consisting of
a set of records R’s, a countable completely ordered domain O (referred to as the ordering domain the
components of the ordering domain are referred to as positions), and an ordering of R by O. An ordering
of a collection of records R’s by an ordering domain O is defined as a relationship between O and R in
which each record, in R is connected with someplace in O. Sequence operators are n-ary mappings on
sequences that operate on a certain number of input sequences to produce a single output sequence. All
sequence operators are computable.
8
Sequence operators include select, project, compose (natural join on the position) and aggregate more
than a set of positions. Due to capacity limitations, we advise the learners to read a formal explanation
of the sensor time - series data.
Sensor data is represented as a time series having the following properties:

 The set of records corresponds to the time-varying outputs of a signal processing algorithm.
 The ordering domain is a discrete time scale, i.e., a collection of time quantum, with a position
assigned to each time quantum. We employ natural numbers as the time-series ordering domain
in the remainder of the research. Each natural integer denotes the number of time units that have
elapsed between a specified origin and any (discrete) point in time. We assume that clocks are
synced, such that all sensors operate on the same time scale.
 All signal processing function outputs created during a time quantum are linked with the same
point p. It should be noted that if a sensor does not create events within the time period associated
with a position, the Null record is associated with that position.
 The base sequence is modified at the position relating to the beginning of manufacture whenever a
signal processing operation produces an output. As an outcome, sensor time series upgrades take
place in increasing position order.
 Emerging trends are cutting-edge technologies that gain popularity and establish a new trend
among consumers.
 A Data Model is similar to a plan for a structure created by an engineer.
 Advanced modelling can be used by the application developer to include components that are not
built into the programme by default.
 The relational model, in which all data is specified in terms of tables and columns, is the most widely
known and utilised data model today.
 A network model exists beneath the relational model in an RDM database, where all data is specified
in terms of record types and fields.
 Spatial data is linked to geographic places such as cities, villages and so on.
 A deductive database contains not just facts but also general principles, which may be used to derive
new facts that are not explicitly reflected in the database; data can therefore be stored implicitly.
 A temporal database contains information about certain points in time.
 The term “multimedia database” refers to a set of associated multimedia content that may comprise
text, graphics, photos, animations, video, audio and other types of data from numerous sources.
 A sensor database contains both stored and sensor data. Stored data comprise the set of sensors
that participate in the sensor database, as well as sensor attributes (i.e, location) or physical
environment parameters.
9
JGI JAIN
15.9 GLOSSARY
 Relational model: All data is specified in terms of tables and columns, is the most widely known and
utilised data model today
 Spatial database: It is designed to store and retrieve data that represent things. These are the items
that have a geometric definition
 Deductive database: It contains not just facts but also general principles, which may be used to
derive new facts that are not explicitly reflected in the database; data can therefore be stored
implicitly
 Transaction time: It refers to when a fact was recorded in the database.
 Sensor database: It contains both stored and sensor data.

1. What do you understand by advanced database models?
2. Define the concept of spatial database.
3. Explain the temporal database.
4. What is the multimedia database?
5. Define the concept of sensor database system.

1. Advanced modelling can be used by the application developer to include components that are not
built into the programme by default. This can help in understanding the views, attributes and
authorisation structure of the project. Refer to Section Advanced Database Models
2. Spatial data is linked to geographic places such as cities, villages and so on. A spatial database is
designed to store and retrieve data that represent things. These are the items that have a geometric
definition. Refer to Section Spatial Data
3. A temporal database contains information about certain points in time. It provides temporal data
types and retains information about the past, present and future. Databases often only include
information about the present situation and not so much about earlier situations. Refer to Section
Temporal Database
4. The term “multimedia database” refers to a set of associated multimedia content that may comprise
text, graphics, photos, animations, video, audio and other types of data from numerous sources.
Multimedia database management system is a structure that organises various forms of multimedia
10
content that can be provided, saved and used in many ways. Refer to Section Multimedia Database
5. A sensor database contains both stored and sensor data. Stored data comprise the set of sensors
that participate in the sensor database, as well as sensor attributes (i.e, location) or physical
environment parameters. Refer to Section Sensor Database System
 https://www.safe.com/blog/2021/11/7-spatial-databases-enterprise/
 https://people.cs.pitt.edu/~chang/231/y08/y08sem/semkul/00621580.pdf
 Discuss with your friends and colleagues the advanced data models. Also, discuss the concept of
spatial, deductive, temporal and multimedia database.
11

Advanced Data Management Techniques

Uploaded by

Copyright:

Available Formats

Advanced Data Management Techniques

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Data Management Techniques

Uploaded by

Copyright:

Available Formats

UNIT

01 Introduction to Data Management

In this unit, you will learn to:

At the end of this Unit, you would:

1.2 TYPES OF DATA

There are broadly two types of data, as shown in Figure 1:

Qualitative Data Quantitative Data

Figure 1: Data Types

The two classifications that best define them are:

1.2.3 Unstructured Data

1.3 OVERVIEW OF DATA MODELS

1.4 INTRODUCTION TO OLTP

1.4.1 Need of OLTP

1.4.2 advantage of OLTP

1.4.3 Limitations of OLTP

 OLTP lacks suitable means for sending merchandise to customers.

1.5 DIMENSIONAL MODELLING LIFE CYCLE

1.6 NEW TRENDS IN COMPUTING

1.7 MOTIVATION AND NEED FOR ADVANCED DATA MANAGEMENT TECHNIQUES

Conclusion 1.8 CONCLUSION

1.10 SELF-ASSESSMENT QUESTIONS

A. Essay Type Questions

1.11 ANSWERS AND HINTS FOR SELF-ASSESSMENT QUESTIONS

B. Hints for Essay Type Questions

@ 1.12 POST-UNIT READING MATERIAL

1.13 TOPICS FOR DISCUSSION FORUMS

In this unit, you will learn to:

At the end of this unit, you would:

Pre-Unit Preparatory Material

2.2 COMPLEX RETRIEVAL QUERIES USING GROUP BY

2.2.1 Grouping using Multiple Columns

The output of the given query is as follows:

2.2.2 Grouping and Aggregate Functions

Gender COUNT (‘membership number’)

2.3 RECURSIVE QUERIES

A recursive query may be specified as follows:

2.3.1 Using the WITH RECURSIVE Clause

Employee # = 1003 Manager Employee # = 1019 Employee # = 1016

Employee # = 1008 Employee # = 1006 Employee # = 1014 Employee # = 1011

Employee # = 1010 Employee # = 1001 Employee # = 1004

Employee # = 1012 Employee # = 1002 Employee # = 1015

Figure 1: Employee Table in Hierarchical Order

2.4 NESTED QUERIES

Table 1: STUDENT Table

S_ID S_NAME S_ADDRESS S_PHONE S_AGE

Table 2 shows the COURSE table:

Table 2: COURSE Table

Table 3 shows the STUDENT_COURSE table:

Table 3: STUDENT_COURSE Table

There are different kinds of nested queries, which are as follows:

2.5 SPECIFYING CONSTRAINTS AS ASSERTIONS

FROM EMPLOYEE E, EMPLOYEE M, DEPARTMENT D

2.6 EVENT CONDITION ACTION (ECA) MODEL (TRIGGERS) IN SQL

2.7 CREATING VIEWS IN SQL

2.7.1 Creating Views

The fundamental syntax for CREATE VIEW is as follows:

2.8 DATABASE PROGRAMMING

2.9 EMBEDDED SQL

2.10 DYNAMIC SQL

Figure 2 depicts the path of UAP development when SQLJ is utilized: