Advanced Data Management Techniques
Advanced Data Management Techniques
Advanced Data Management Techniques
Names of Sub-Units
Introduction to Data Management, Types of Data, Overview of Data Models, Introduction to OLTP,
Dimensional Modeling Life Cycle, New Trends in Computing, Motivation and Need for Advanced Data
Management Techniques.
Overview
The unit begins by explaining types of data. Further, the unit provides the overview of data models.
The unit also familiarises you with the importance of OLTP and its limitations. Towards the end, you
will be acquainted with dimensional modelling life cycle and motivation and need for advanced data
management techniques.
Learning Objectives
Learning Outcomes
1.1 INTRODUCTION
Data management is the process of absorbing, storing, organising and preserving an organisation’s
data. Effective data management is a critical component of establishing IT systems that operate business
applications and offer analytical information to enable corporate executives, business managers and
other end-users to drive operational decision-making and strategic planning.
The data management process consists of a number of tasks that work together to ensure that the data
in business systems is correct, available and accessible. The majority of the needed work is done by IT
and data management teams, but business users often engage in various portions of the process to
ensure that the data fulfils their needs.
Data management has become increasingly important as firms face a growing number of regulatory
compliance obligations. Furthermore, businesses are gathering ever-increasing amounts of data and
a broader range of data types, both of which are trademarks of the big data platforms that many have
adopted. Without proper data management, such settings may become cumbersome and difficult to
traverse.
Since data is so crucial in our lives, it is critical that it is correctly stored and processed without any
errors. When dealing with datasets, the type of data plays an essential role in determining which pre-
processing approach would work best for a certain set to obtain the best results. In other words, data
type is basically a trait associated with a piece of data that instructs a computer system on how to
interpret its value. By understanding data types, one can ensure that data is collected in the desired
format and the value of each property is as expected.
2
UNIT 01: Introduction to Data Management Techniques JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Types of Data
Nominal Discrete
Ordinal Continuous
3
JGI JAINDEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Another way of categorising data is the way it is arranged. Under this category, data is classified as
structured, semi-structured and unstructured. Let us understand them in detail.
1.2.1 Structured
Structured data is arranged into a prepared repository, usually a database, so that its pieces may be
addressed for more efficient processing and analysis. In a database, for example, each field is distinct,
and its data may be accessed alone or in conjunction with data from other fields, in a number of
combinations. The database’s strength is its capacity to make data comprehensive enough to offer
relevant information. SQL (standard query language) is a database query language that allows a
database administrator to interface with the database.
Structured data is distinguished from unstructured and semi-structured data. The three types of data
may be thought of as being on a scale with unstructured data being the least formatted and structured
data being the most formatted. As data becomes more organised, it becomes more receptive to
processing.
1.2.2 Semi-structured
Semi-structured data is arranged into a specific repository, such as a database, but includes associated
information, such as metadata, that makes it easier to process than raw data. In other words, semi-
structured data falls somewhere in between, which means it is neither fully structured nor unstructured.
It is not structured in a way that allows for advanced access and analysis; nonetheless, it may include
information connected with it, such as metadata tagging that allows components contained to be
addressed. Unstructured data is commonly assumed to be a Word document. However, metadata tags
can be added in the form of keywords that reflect the document’s content; as a result, the document can
be easily found if those keywords are searched for. The data is now semi-structured. Nonetheless, the
document lacks the database’s deep arrangement and hence falls short of being completely organised
data.
4
UNIT 01: Introduction to Data Management Techniques JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
formats, including Word documents, email messages, PowerPoint presentations, survey replies, contact
centre transcripts, and blog and social media postings. Images, audio, and video files are examples of
unstructured data. Machine data is another type of unstructured data that is rapidly developing in many
organisations. For example, log files from websites, servers, networks, and apps, particularly mobile
applications, provide related to activity and performance. Furthermore, businesses are increasingly
capturing and analysing data from sensors on industrial equipment and other IoT-connected devices.
Data models include vital information for organisations because they define the relationships between
database tables, foreign keys, and the events involved. The three fundamental data model styles are as
follows:
Conceptual data model: A conceptual data model is a logical representation of database ideas
and their connections. The goal of developing a conceptual data model is to define entities, their
properties, and their connections. There is very little information about the actual database structure
at this level of data modelling. A conceptual data model is often created by business stakeholders
and data architects.
Logical data model: The logical data model is used to specify the structure of data pieces as well as
their connections. The elements of the conceptual data model are supplemented by the logical data
model. The benefit of adopting a logical data model is that it serves as a basis for the Physical model.
The modelling framework, on the other hand, remains general.
There are no main or secondary keys declared at this data modeling level. You must validate and
change the connection details that were defined earlier for relationships at this data modelling level.
Physical data model: A physical data model specifies how the data model is implemented in a
database. It provides database abstraction and aids in the creation of the schema. This is due to
the abundance of meta-data that a physical data model provides. By duplicating database column
keys, constraints, indexes, triggers, and other RDBMS characteristics, the physical data model aids
in visualising database structure.
5
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
OLAP software is used to do multidimensional analysis on huge amounts of data from a data warehouse,
data mart, or other unified, centralised data repository at fast speeds. Most corporate data have
several dimensions that are, multiple categories into which the data is divided for display, monitoring
or analysis. Sales data, for example, may include various aspects linked to place (region, nation, state/
province, shop), time (year, month, week, day), product (clothing, men/women/children, brand, kind),
and other factors. However, data sets in a data warehouse are kept in tables, each of which may organise
data into just two of these dimensions at a time. OLAP takes data from various relational data sets and
reorganises it into a multidimensional structure for quick processing and in-depth analysis.
6
UNIT 01: Introduction to Data Management Techniques JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
7
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Graph databases: Patterns and correlations may be defined in graph databases using mathematical
structures that reflect pairwise relationships between items. Graph databases outperform typical
relational databases when dealing with huge datasets. Graph databases are made up of nodes or
data items that are linked together. These connections or commonalities endure between nodes.
This can be useful for merchants in identifying consumer behaviour and purchasing trends. Graph
databases behave differently than relational databases.
Data protection: Data security is usually a top trend in data management, as organisations
emphasise data integrity and reduce risks of data breaches and loss. Secure data solutions, whether
on-premises or in hybrid multi-cloud settings, enable you to get improved visibility and insights to
analyse and address risks. They also implement real-time controls and conform to a plethora of
compliance mandates such as GDPR, PCI, HIPAA, and SOX. Security infrastructure based on SIEM,
SOAR, and SASE produces highly automated detection and response systems that provide security
professionals with the most recent breakthroughs in corporate defence.
artificial Intelligence (aI): Artificial intelligence is sometimes referred to as limited or weak AI since
it is meant to accomplish a small or particular purpose. Google search, picture recognition software,
personal assistants like Siri and Alexa, and self-driving cars are all instances of limited AI. AI reliably
and without tiredness executes frequent, high-volume computerised tasks. AI searches for structure
and regularities in data to learn a skill, transforming the algorithm into a classifier or predictor. As
a result, just as the algorithm can teach itself how to play chess, it can also educate itself on what
product to promote next on the Internet. When presented with fresh data, the models evolve over
time.
Persistent Memory (PMEM): Businesses expect higher performance when apps based on machine
learning or artificial intelligence (AI) operate in real-time. Keeping more data near to the processor
and in a permanent state gives crucial benefits such as higher throughput and lower latency.
PMEM, which was created in collaboration between Intel and Micron, allows Intel-based servers to
extend their memory footprint up to 4 TB in capacity. This delivers a significant speed improvement
for in-memory databases or locally saved information. PMEM storage medium outperforms NVMe
devices in terms of speed.
Natural Language Processing: The combination between human language and technology is
known as natural language processing (NLP). Voice-activated technology such as Alexa, Siri, and
Google Home are well-known instances of natural language processing. NLP acts as a front-end
to an Artificial Intelligence backend, converting speech queries into actionable results. As NLP
advances, so does the gap in voice search data. Businesses must get on board with voice searches,
which are estimated to account for 50% of all searches by 2020. NLP assists in the processing and
collection of voice-based data, and it may also function internally for individuals who require data
access via voice requests.
Enhanced analytics: Augmented analytics augments how individuals explore and analyse data in
analytics and BI systems by utilising enabling technologies like machine learning and AI to aid with
data preparation, insight production and insight explanation. Because the analysis is automated
and can be configured to run continually, the heavy lifting of manually sorting through enormous
volumes of complicated data (due to a lack of skills or time restrictions) is considerably minimised.
Augmented data preparation combines data from numerous sources considerably more quickly.
Embedded analytics: Those who understand the realm of analytics are also aware of how time-
consuming and ineffective data translation can be. Embedded analytics is a digital workplace
feature that allows data analysis to take place within a user’s natural workflow rather than
switching to another application. Furthermore, embedded analytics is typically used to optimise
certain marketing campaigns, sales lead conversions, inventory demand planning and financial
8
UNIT 01: Introduction to Data Management Techniques JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
budgeting. Embedded analytics may help an organisation to decrease the amount of time analytics
professionals spend transforming data into readable insights for everyone. That means they can
do what they do best: evaluate data to generate answers and plans. It even enables employees to
engage with data and create simple visuals for better decision-making.
These capabilities are considered advanced in relation to DBMSs. Many complex database management
systems are available as cloud services, although others remain available as software solutions, the
most prominent of which being MongoDB. Navicat for MongoDB is a wonderful tool to use if you want to
understand more about it. It supports database objects including Collections, Views, Functions, Indexes,
GridFS, and MapReduce. An object designer is also available for creating, modifying and designing
database objects.
Advanced data management has long been at the heart of effective database and information systems.
It covers a wide range of data models and the fundamentals of organising, processing, storing and
querying data using these models.
Data management is the process of absorbing, storing, organising and preserving an organisation’s
data.
The data management process consists of a number of tasks that work together to ensure that the
data in business systems is correct, available and accessible.
When dealing with datasets, the type of data plays an essential role in determining which
preprocessing approach would work best for a certain set to obtain the best results
9
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Structured data is arranged into a prepared repository, usually a database, so that its pieces may be
addressed for more efficient processing and analysis
Semi-structured data is arranged into a specific repository, such as a database, but includes
associated information, such as metadata, that makes it easier to process than raw data.
Unstructured data is arranged into a format that allows it to be accessed and processed more easily
A data model is a representation of the logical interrelationships and data flow between various
data items in the information domain.
Data models include vital information for organisations because they define the relationships
between database tables, foreign keys, and the events involved.
Online Analytical Processing (OLAP) refers to a type of software that allows users to examine data
from various database systems at the same time.
Dimensional Modelling (DM) is a data structure approach designed specifically for data storage in
a data warehouse.
1.9 GLOSSARY
Database: A set of data that is organised to allow users to find and retrieve it quickly and easily
DBMS: A computerised database management system
OLaP: A powerful analysis tool used for forecasting, statistical computations and aggregations
Data management: It is the process of absorbing, storing, organising and preserving an
organisation’s data.
Structured data: It is arranged into a prepared repository, usually a database, so that its pieces may
be addressed for more efficient processing and analysis
Data model: A representation of the logical interrelationships and data flow between various data
items in the information domain.
Dimensional Modelling (DM): It is a data structure approach designed specifically for data storage
in a data warehouse.
10
UNIT 01: Introduction to Data Management Techniques JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
https://pdfs.semanticscholar.org/d127/c1bf1fb31fe33f054f3bd2dc4f44c0615987.pdf
Using various sources, find the applications of OLAP in the real-world and discuss it with your
classmates.
11
UNIT
02 Advanced SQL
Names of Sub-Units
Complex Retrieval Queries using the Group By Clause, Recursive Queries, Nested Queries, Specifying
Constraints as Assertions, Event Condition Action (ECA) model (Triggers) in SQL, Creating and
working with Views in SQL, Database Programming, Embedded SQL, Dynamic SQL, SQLJ, and Stored
Procedures in SQL.
Overview
The unit begins by discussing the concept of complex retrieval queries with the help of the Group By
clause. Next, the unit discusses the concept of recursive queries, nested queries, and specify constraints
as assertions. Further the unit discusses the Event Condition Action (ECA) model in SQL. Next, the unit
discusses the process to create views in SQL and database programming, The unit also discusses the
embedded SQL, dynamic SQL, and SQLJ. Towards the end, the unit discusses the stored procedures in
SQL.
Learning Objectives
Learning Outcomes
http://tinman.cs.gsu.edu/~raj/books/Oracle9-chapter-6.pdf
2.1 INTRODUCTION
The database is becoming more interconnected, and this has produced a demand for standard language
that can be utilized in a variety of computer contexts. SQL has proven to be the standard language
because it allows programmers or developers to learn a single set of instructions and use them to
generate, retrieve, edit, or transfer data regardless of whether they are working on a PC, workstation,
mini, or mainframe.
SQL is an abbreviation for Structured Query Language. SQL is used in a database management system
to build, delete, and edit databases and database objects, as well as to save, retrieve, and update data
in a database. SQL is a language that is used to create, access, and manipulate database management
systems. SQL is compatible with all current relational database management systems, including SQL
Server, Oracle, and MySQL.
2
UNIT 02: Advanced SQL JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
category_id year_released
1 2011
2 2008
NULL 2008
NULL 2010
8 2007
6 2007
6 2007
8 2005
NULL 2012
7 1920
The given result has several duplicates. Let’s do the same query in SQL using group by:
SELECT 'category_id','year_released' FROM 'movies' GROUP BY 'category_
id','year_released';
When we run the above script in MySQL workbench against the myflixdb database, we get the results
displayed below:
category_id year_released
NULL 2008
NULL 2010
NULL 2012
1 2011
2 2008
6 2007
7 1920
8 2005
8 2007
In our previous example, the GROUP BY clause uses both the category id and the year released to identify
unique data.
If a row has the same category id but a different year of release, it is considered unique. If the category
id and year released are the same for more than one row, it is deemed a duplication, and only one row
is displayed.
3
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
The following results are grouped by each unique gender value posted, and the number of grouped rows
is tallied using the COUNT aggregate function.
The SQL GROUP BY Clause is used to group rows that have the same values. The SQL SELECT statement
is used in conjunction with the GROUP BY Clause. The GROUP BY clause’s SELECT statement can only
contain column names, aggregation functions, constants, and expressions. The SQL Having Clause is
used to limit the results of the GROUP BY clause. The MYSQL GROUP BY Clause is used to aggregate data
from several records and return records that are defined by one or more columns.
4
UNIT 02: Advanced SQL JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
The created table depicts an organisational structure with a hierarchical structure of employee-
manager data. Figure 1 displays the employee table in a hierarchical order:
Employee # = 801
Manager Employee # = NULL
5
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
The final query, which is not part of the recursive WITH clause, retrieves employee data from temp_table:
SELECT * FROM temp_table ORDER BY employee_number;
The recursive query yielded the following results:
employee_number
1001
1002
1003
1004
1006
1008
1010
1011
1012
1014
1015
1016
1019
2.3.2 Using the RECURSIVE Clause in a CREATE VIEW Statement to Create a View
Using the RECURSIVE clause to create a view is analogous to using the WITH RECURSIVE clause to
precede a query.
Take a look at the employee table from Using the WITH RECURSIVE Clause. The following statement
generates the hierarchy 801 view, which uses a recursive query to retrieve the employee numbers of all
employees who report to the manager with employee number 801 directly or indirectly:
CREATE RECURSIVE VIEW hierarchy_801 (employee_number) AS
( SELECT root.employee_number FROM employee root WHERE root.manager_
employee_number = 801
UNION ALL
SELECT indirect.employee_number FROM hierarchy_801 direct, employee
indirect WHERE direct.employee_number = indirect.manager_employee_number
);
Use the SELECT command on the table to get the employee details:
SELECT * FROM hierarchy_801 ORDER BY employee_number;
The result of the given SQL query is as follows:
employee_number
1001
1002
1003
1004
1006
1008
1010
1011
1012
6
UNIT 02: Advanced SQL JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
1014
1015
1016
1019
C_ID C_NAME
C1 DSA
C2 Programming
C3 DBMS
S_ID C_ID
S1 C1
S1 C3
S2 C1
S3 C2
S4 C2
S4 C3
7
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
IN: If we want to find out which S_IDs are enrolled in C_NAME ‘DSA’ or ‘DBMS,’ we can use an
independent nested query and the IN operator. We can locate C_IDs for C_NAME ‘DSA’ or DBMS’ in
the COURSE table, and we can utilize these C_IDs to get S_IDs in the STUDENT_COURSE TABLE. The
steps for utilize these C_IDs to get S_IDs are as follows:
STEP 1: Determine C_ID for C_NAME = “DSA” or “DBMS”
Select C_ID from COURSE where C_NAME = 'DSA' or C_NAME = 'DBMS'
STEP 2: Using the C_ID from step 1 to obtain the S_ID
Select S_ID from STUDENT_COURSE where C_ID IN (SELECT C_ID from
COURSE where C_NAME = 'DSA' or C_NAME='DBMS');
The inner query will yield a set with members C1 and C3, while the outer query will return S_IDs
whose C_ID matches any member of the set (C1 and C3 in this case). As a result, it will return S1, S2,
and S4.
If we wish to find the names of STUDENTS who have enrolled in ‘DSA’ or ‘DBMS,’ we may do it as
follows:
Select S_NAME from STUDENT where S_ID IN (Select S_ID from STUDENT_
COURSE where C_ID IN (SELECT C_ID from COURSE where C_NAME='DSA'
or C_NAME='DBMS'));
NOT IN: If we wish to locate the S_IDs of STUDENTS who have not enrolled in ‘DSA’ or ‘DBMS,’ we may
do it as follows:
S_ID FROM STUDENT WHERE S_ID IS NOT IN (S_ID FROM STUDENT_COURSE
WHERE C_ID IS IN (SELECT C_ID FROM COURSE WHERE C_NAME='DSA' OR C_
NAME='DBMS'));
The innermost query will return a set containing C1 and C3 elements. The second inner query will
yield S_IDs whose C_ID is equal to any member of the set (in this example, C1 and C3), which are S1,
S2, and S4. The outermost query will yield S_IDs that are not members of the set (S1, S2, and S4). As
a result, it will return S3.
Nested queries with correlations: In co-related nested inquiries, the result of the inner query is
determined by the row presently being run in the outer query. For example, if we want to find out the
S_NAME of STUDENTS who are enrolled in C_ID ‘C1’, we may use a co-related nested query such as:
Select S_NAME from STUDENT S if it already exists (select * from
STUDENT_COURSE SC if S.S_ID=SC.S ID and SC.C_ ID='C1');
It will discover the rows from STUDENT_COURSE where S.S ID = SC.S_ID and SC.C_ID=’C1’ for each
row of STUDENT S. If there is at least one entry in STUDENT_COURSE SC with C_ ID=’C1’ for a S_
ID from STUDENT S, the inner query will return true, and the associated S_ID will be returned as
output.
8
UNIT 02: Advanced SQL JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
9
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
10
UNIT 02: Advanced SQL JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Embedded SQL allows us to use databases as and when we need them. Several things must be taken care
of once the application we design is put into production.
We may simply use the database with the aid of query embedding without writing any hefty code. We
can use embedded SQL to construct APIs that can simply fetch and feed data as needed.
We need certain tools in each high-level language to use embedded SQL. In other circumstances, we
have built-in libraries that give us fundamental building blocks.
In certain circumstances, we must import or use other packages to complete the job at hand.
In Java, for example, we require a connection class. We start by creating a connection with the connection
class, and then we open the connection without providing the essential parameters to connect to the
database.
2.11 SQLJ
SQLJ is a language standard for encoding and executing a static SQL statement as an embedded SQL
statement in Java.
11
JGI JAINDEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
SQLJ
Source file
SQLJ Translator
Java
Source file
Java compiler
12
UNIT 02: Advanced SQL JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
13
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
INSERT INTO EMPLOYEE (SSN, FNAME, LNAME, ADDRESS, SEX, SALARY) VALUES
('CC01','YOGITA','SINGH','MANGALORE','F', 750000);
INSERT INTO EMPLOYEE (SSN, FNAME, LNAME, ADDRESS, SEX, SALARY) VALUES
('CC02','SANTHOSH','VERMA','MANGALORE','M', 500000);
INSERT INTO EMPLOYEE (SSN, FNAME, LNAME, ADDRESS, SEX, SALARY) VALUES
('SE01','VEENA','DEVI','DELHI','M', 800000);
INSERT INTO EMPLOYEE (SSN, FNAME, LNAME, ADDRESS, SEX, SALARY) VALUES
('IT01','NAGESH','JAIN','MEERUT','M', 700000);
The SQL command for creating view (with and without check option) is as follows:
CREATE VIEW sales_staff AS 2 SELECT fname, ssn, dno 3 FROM employee
4 WHERE dno =5 5 WITH CHECK OPTION CONSTRAINT sales_staff_cnst; View
created.
The SQL command for selecting from a view is as follows:
select * from sales_staff;
The SQL command for droping a view is as follows:
DROP VIEW sales_staff;
2. Create a row level trigger for the customer table that would fire for INSERT or UPDATE or DELETE
operations performed on the CUSTOMER table. This trigger should display the salary difference
between the old values and new values.
Ans. The SQL command for creating a trigger are as follows:
CREATE OR REPLACE TRIGGER display_salary_changes
BEFORE DELETE OR INSERT OR UPDATE ON customers
FOR EACH ROW
WHEN (NEW.ID > 0)
DECLARE
sal_diff number;
BEGIN
sal_diff := :NEW.salary - :OLD.salary;
dbms_output.put_line('Old salary: ' || :OLD.salary);
dbms_output.put_line('New salary: ' || :NEW.salary);
dbms_output.put_line('Salary difference: ' || sal_diff);
END;
/
The SQL command for checking the salary difference by procedure are as follows:
DECLARE
total_rows number(2);
BEGIN
UPDATE customers
SET salary = salary + 5000;
IF sql%notfound THEN
dbms_output.put_line('no customers updated');
ELSIF sql%found THEN
total_rows := sql%rowcount;
dbms_output.put_line( total_rows || ' customers updated ');
END IF;
END;
/
14
UNIT 02: Advanced SQL JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
The database is becoming more interconnected, and this has produced a demand for standard
language that can be utilized in a variety of computer contexts.
The GROUP BY clause is a SQL statement that is used to group rows with similar values. In the
SELECT query, the GROUP BY clause is utilized.
A recursive query is a method of querying data hierarchies such as an organizational structure, bill-
of-materials, or document hierarchy.
A query is written within another query in nested queries.
In SQL, users can use the DDL’s Establish ASSERTION statement to create more general constraints,
such as attribute level or table level check constraints, that do not fall into any of the categories.
A trigger is a type of store procedure that executes in reaction to particular table operations such as
insertion, deletion, or updating of Data.
The trigger is based on the ECA model, which stands for Event-Condition-Action.
A view is nothing more than a SQL statement with a name that is saved in the database.
The Create VIEW statement is used to create database views.
Database programming is the process of creating and managing a database for an application.
SQL stands for Structured Query Language. It is the language we use to conduct database operations
and transactions.
Dynamic SQL is a programming approach that allows you to write SQL queries as they are running.
SQLJ is a language standard for encoding and executing a static SQL statement as an embedded
SQL statement in Java.
Stored Procedures are programs that are written to conduct one or more DML operations on a
database.
2.15 GLOSSARY
SQL: A language that is used to create, access, and manipulate database management systems
Recursive query: A method of querying data hierarchies such as an organizational structure, bill-
of-materials, or document hierarchy
Nested query: A query is written within another query
Trigger: A type of store procedure that executes in reaction to particular table operations such as
insertion, deletion, or updating of data
Database programming: The process of creating and managing a database for an application
Dynamic SQL: A programming approach that allows you to write SQL queries as they are running
Stored procedures: Programs that are written to conduct one or more DML operations on a database
15
JGI JAINDEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
https://docs.microsoft.com/en-us/sql/relational-databases/views/views?view=sql-server-ver15
https://www.temok.com/blog/static-vs-dynamic-sql/
Discuss with your classmates about the concept of trigger, dynamic SQL, and embedded SQL.
16
UNIT
Advanced Transaction Processing
& Recovery
Names of Sub-Units
Overview
This unit begins by discussing the concept of advanced transaction processing & recovery and a review
of ACID properties and serializability. Next, the unit describes the multiversion concurrency control
techniques and granularity of data items. Further, the unit explains the multiple granularity locking.
Towards the end, the unit covers the advanced database recovery techniques.
Learning Objectives
Learning Outcomes
https://www.uio.no/studier/emner/matnat/ifi/INF5100/h04/undervisningsmateriale/
HANDOUTS/L05-transaction-models.pdf
3.1 INTRODUCTION
Transaction is a unit of database operation. It can be a read operation, a write operation, or a read/
write operation. It can be said a transaction is a sequence of many actions grouped and considered as
an atomic unit of work. A query written in Structured Query Language (SQL) may involve more than one
transaction that includes reading or writing the data from various tables. Every day, many database
systems deal with a large number of transactions renamed the database system as a transaction
processing systems.
A transaction can involve four types of operations, which are described as follows:
READ (A): Reads data from the database tables that are already stored in the memory, where A
signifies the record that is to be read
WRITE (A): Writes the record A that may be the result of a transaction back to the main memory or
to the disk
COMMIT: Makes sure that all the changes made by the transactions are made permanent
ROLLBACK: Terminates the transaction in case of an unsuccessful transaction and undoes any
changes made in the database
Let’s take an example to clarify the idea about how transaction takes place in case of a library
management system when a user is about to borrow a book B from the library online. Let’s assume
that the account number for the customer is A. In the following example, the code is written in pseudo-
language:
BEGIN
READ A
IF A Exist Then
Look for the Availability of the Book B
IF B is available Then
Assign the book to user A
Set the borrow_period
2
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
WRITE A.borrow_period;
WRITE B.borrow_period;
COMMIT
ELSE
BEGIN
DISPLAY "Book is not available,"
ROLLBACK
END
ELSE
DISPLAY "Account Does Not Exist."
END
In the preceding example, the aforementioned four operations are used to make you understand how
these operations take place in a transaction. This example shows how a user can borrow a book online by
checking the availability of the book. In this case, the user database as well as the database maintained
for the books has been updated. After the successful updating of data, the database makes the changes
permanent by executing the operation. Otherwise, it rolls back the operation in case of an unsuccessful
operation.
3
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
3.2.1 Serializability
Serializability theory in DBMS deals with maintaining the correctness of concurrent execution. When two
transactions are executed concurrently, there is no way to determine whether any problem has occurred
or not. Serializability provides us a means to schedule the transaction so that these twotransactions
never conflict with each other. To know more about the rule of serializability, it is required to know about
a serial schedule from where the name is derived and also about the equivalent schedule.
Serial Schedule
Serial schedule is a type of schedule where no transaction ever interleaved with each other, i.e., after
the complete execution of a transaction, another transaction begins its execution. Table 1 presents an
example of a serial schedule:
T1 T2 T1 T2
READ A READ A
A : = A – 250 READ B
WRITE A SHOW A + B
READ B READ A
B : = B + 250 A : = A – 250
WRITE B WRITE A
READ A READ B
READ B B := B + 250
SHOW A + B WRITE B
Equivalent Schedule
Two schedules can be an equivalent schedule if the execution of one schedule leaves the database in a
state, which is similar to the state acquired after the execution of another transaction. Table 2 presents
an example of an equivalent schedule:
Schedule 1 Schedule 2
T1 T2 T1 T2
READ A READ A
A := A – 250 A := A – 250
WRITE A WRITE A
READ B READ A
B := B + 250 A: = A*10
WRITE B WRITE A
4
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Schedule 1 Schedule 2
READ A READ B
A: = A*10 B := B + 250
WRITE A WRITE B
READ B READ B
B:= B/5 B:= B/5
WRITE B WRITE B
Schedule S1
T1 T2
READ A
WRITE A
READ A
WRITE A
READ B
WRITE B
READ B
WRITE B
The preceding schedule shows an interleaved transaction. Each transaction is divided into small unit
of transactions, such as t(1) and t(2) for transactions T1 and T2, respectively. The rule of conflict
serializability states that a schedule can be conflict serializable if the following conditions hold:
The two operations (READ or WRITE) in a schedule belong to different transactions
The two operations access the same data item
One of the two operations is a WRITE operation
From the preceding schedule, the following cases can be considered and detect the conflict serializability
in the following cases:
Case 1: If t(1) is READ A and t(2) is also READ A, then they read the same data item A; however, none
of them interferes with other operations. Thus, they perform independently and this interleaved
operation does not create any conflict. Here, the order of execution does not matter.
Case 2: If t(1) is WRITE A and t(2) is READ A, then the order of execution does matter since the value
of A written by t(1) can be read by t(2) if t(1) is executed before t(2). If t(2) is executed before t(1), then
t(2) reads the value that is previously in the database. Thus, this execution can raise a conflict.
5
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Case 3: If t(1) is WRITE A and t(2) is WRITE A, then the order of execution does matter since the write
operation, which is executed at last, should write its value to the database that retains for the next
operation. If t(1) executes at first, then the value written by t(2) should be the value of A after the
execution of t(2) and if t(2) executes first, then the value written by t(1) should be the final value of A.
Thus, it can be said that for the second case, conflict may arise between t(1) and t(2) since they operate
on the same data and one of them is a WRITE operation. The conflicting operations in schedule S1 are
presented in Table 4:
Schedule S1
T1 T2
READ A
WRITE A
READ A
WRITE A
READ B
WRITE B
READ B
WRITE B
“A schedule is called conflict serializable if it is conflict equivalent to some serial schedule.”
Now the concern is what is conflict equivalent? Conflict equivalence between two schedules S1 and S2
can be defined as follows:
“Schedule S1 is said to be conflict equivalent with schedule S2 if they perform the same actions of the
transaction on the same database item and every pair of conflicting operations should be in the same
order.”
Table 5 presents an example of conflict equivalence, where the order of conflicting operations in schedule
S1 is in the same order as that in S2:
Schedule S1 Schedule S2
T1 T2 T1 T2
READ A READ A
WRITE A WRITE A
READ A READ B
WRITE A WRITE B
READ B READ A
WRITE B WRITE A
READ B READ B
WRITE B WRITE B
6
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Schedule S1
T1 T2 T3
READ A
WRITE A
WRITE A
WRITE A
Schedule S2
T1 T2 T3
READ A
WRITE A
WRITE A
WRITE A
It can be noticed from the preceding two tables that in schedule S1, T1 and T3 write data that have
not been read. This type of writing is known as blind write and appears only in the view serializable
schedule, which is not conflict serializable, because view-serializable schedule handles blind write more
effectively.
The two schedules S1 and S2 are not conflict equivalent, but they are view equivalent. Thus, it is not
necessary that any view-serializable schedule is conflict serializable. However, every conflict-serializable
schedule is view serializable. It is easier to implement conflict serializability than view serializability.
7
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
8
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
According to the rule stated in rule 1, transaction T can be rolled back. This is because if transaction T
is attempting to write a version of A, which has been read by some other transaction at the same time,
then this causes conflict. Therefore, T needs to be rolled back.
In addition, from rule 2, it can be concluded that the READ operation following this protocol will always
be successful since it reads the latest version of the data item by comparing the WRITE_TIMESTAMP of
various versions of the same data item.
However, one major problem of this protocol is that it allows cascading rollback. To avoid this scenario,
each WRITE operation needs to be completed before its timestamp has been associated with a data item.
This reduces the risk of cascading rollback. It also needs to update three parameter to update a version
that requires two disk accesses.
9
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
DB
DB2 DBN
FNA FNB FN
10
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
The data item size is often called data item granularity. The fine granularity is referred to as the leaf level
nodes that contain the records or data items. On the other hand, the coarse granularity is referred to as
the nodes that consist of the number of descendants. If a transaction considers fine granularity, then
the degree of concurrency is high compared to the degree of concurrency that results in a transaction,
which considers coarse granularity.
Now the concern is what should be the best size of a data item? The answer is that it depends on the size
of the transaction. If a transaction is small, then it is feasible to hold the locks on the single data items;
if the transaction is long, then it would be feasible to hold the lock on a particular block in the database
that contains the data items, which are required to be locked.
In the next subsection, the multiple granularity 2PL protocol will be devised that helps to maintain
serializability and concurrency in the schedule.
Next, let’s see how locks on the data item implementing granularity can be acquired? If a transaction T1
wants to lock the entire database, then it will be practical to lock the entire database instead of locking
each and every data item in the database. If another transaction T2 wants to lock the data item that
resides in the database, then it is required to traverse the granularity tree from root to leaf to find any
conflicting operations for the same data item that T2 is requested for. If there exists any conflicting
operation on the same data item, then T2 is forced to wait.
It can be implemented the concept of granularity in 2PL protocol, which enhances the performance of
2PL protocol. In this protocol, the transactions can acquire the locks at any level of the granularity tree
instead of locking each item under the node.
11
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
indicate that the type of locks hold by the nodes lies at the higher level of the desired node, which is locked
explicitly in the granularity tree. For example, if a transaction desires to lock node F1Ato update some of
its descendants, then all its ancestors’ nodes in the path from the root to F1A should contain an intension
lock of exclusive. There are three types of intension locks, which are described as follows:
Intension shared lock or IS lock: Indicates that shared locks have been implemented explicitly on all
the descendants at a lower level.
Intension exclusive lock or IX lock: Indicates that exclusive locks or shared locks have been
implemented explicitly on all the descendants at a lower level.
Shared and intension exclusive lock or SIX lock: Indicates that shared locks have been implemented
explicitly on the root of the sub-tree and exclusive locks have been implemented on all the descendants
at a lower level. For example, if a transaction requests for a SIX lock on F1A, then F1A node acquires
a shared lock and all its descendant nodes acquire exclusive locks.
To map the compatibility between different types of locking, compatibility matrix is used, which is shown
in Table 7:
IS IX S SIX X
IS TRUE TRUE TRUE TRUE FALSE
IX TRUE TRUE FALSE FALSE FALSE
S TRUE FALSE TRUE FALSE FALSE
SIX TRUE FALSE FALSE FALSE FALSE
X FALSE FALSE FALSE FALSE FALSE
This matrix is helpful for developing the multiple granularity 2PL protocol. This protocol consists of the
following rules:
Compatibility matrix should be considered while making any decision
Root of granularity tree must be locked first and it can be locked in any mode
A node Ni in the granularity tree can be locked by any transaction Ti in S or IS mode only if the
parent node of Ni has already been locked by Ti in either IX or IS mode
A node Ni in the granularity tree can be locked by any transaction Ti in X, SIX, or IX mode only if the
parent node of Ni has already been locked by Ti in either IX or SIX mode
A transaction Ti can lock a node if it has not previously unlocked any node (This implements 2PL
protocol)
A transaction Ti can unlock a node Ni if none of the child nodes of Ni is currently locked by Ti
From the above-stated rules, it can be emphasised that the first rule has been implemented to protect any
conflicting operations to occur, whereas rules 2, 3 and 4 are there to state the mode of locking of other
nodes of the sub-tree when a node has been acquired a lock. Rules 5 and 6 enforce the 2PL protocol.It can
be seen that the locks in the granularity tree have been acquired in top-down manner, whereas the locks
have been released in bottom-up approach.
Let transaction T1 needs to read all the data items of the disk block DB1. Thus, T1 needs to lock the
database DB in IS mode and DB1 in S mode. Take another transaction T2 that wants to update the record
R1A5. It needs to acquire IX locks for the databases DB1 and F1A. On the other hand, a third transaction
T3 wants to update the entire file F3A. For this purpose, it needs to lock the databases DB1 in IX mode and
12
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
F3A in X mode. If these three transactions are considered, then it can be concluded that transactions
T2 and T3 can take place concurrently since they are operating on different data items in the same disk
block; however, transaction T1 can never be executed concurrently with T2 and T3. Thus, it can be said
that this protocol enhances the type of concurrency that is applied on the transactions and reduces the
lock overhead. The lock manager in this case does not need to maintain so many locking and unlocking
operations. This protocol is especially designed for the following purposes:
Short transactions that request for few data item
Long transactions that access an entire file in case of generating reports
13
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
documented in the log-on disc by force-writing before they are applied to the database on disc,
allowing for recovery. If a transaction fails after recording certain changes in the database on disc but
before reaching its commit point, the effect of the transaction’s activities on the database must be
reversed. In other words, the transaction must be reversed. During recovery, both undo and redomay
be necessary in the case of an instantaneous update. This method, known as the UNDO/REDO
algorithm, necessitates both procedures during recovery and is most commonly utilised in practice.
The UNDO/NO-REDO method is a variant of the algorithm in which all modifications must be saved
in the database on disc before a transaction commits, requiring just undo.
The UNDO and REDO operations must be idempotent, which means that doing an operation
numerous time is identical to performing it just once. In fact, the entire recovery process should be
idempotent because if the system fails during the recovery process, the subsequent recovery attempt
may UNDO and REDO some write_item operations that were previously performed during the initial
recovery process. The outcome of recovering from a system crash during recovering should be the
same as recovering when there is no crash while recovering!
14
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
is kept. The second option, known as shadowing, writes an updated buffer to a new disc location to
preserve several copies of data items, however, this approach is rarely utilised in reality. In general, the
previous value of the data item prior to updating is referred to as the before image (BFIM) and the new
value after updating is referred to as the after image (AFIM). If shadowing is employed, both the BFIM
and the AFIM can be retained on disc; hence, keeping a log for recovery is not necessarily essential.
Buffer Cache
15
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Analysis Phase
In analysis phase, the dirty page table has been formed by analysing the pages from the buffer and also
a set of active transactions has been identified. When the system encounters a crash, ARIES recovery
manager starts the analysis phase by finding the last checkpoint log record. After that, it prepares the
dirty page table. This phase mainly prepares a set of active transactions that are needed to be undone.
In the analysis phase, after getting the last checkpoint log record, the log record is scanned in forward
16
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN DEEMED-TO-BE UNI VE RSI TY
direction and update of the set of active transactions, transaction table and dirty page table are done
in the following manner:
If the recovery manager finds any transaction in the transaction table, which is not in the active
transaction set, then add that transaction in that set
If it finds an end log record, then that record has been deleted from the transaction table.
If it finds a log record that describes an update operation, then that log record has been added to
the dirty page table.
The transactions that are added in the active transaction set are all required to be undone.
Redo Phase
Redo phase is the second phase where all the transactions that are needed to be executed again take
place. It executes those operations whose results are not reflected in the disk. It can be done by finding
the smallest LSN of all the dirty pages in dirty page table that defines the log position and the redo
operation will start from this position. This position indicates that either the changes that are made
earlier are in the main memory or have already been flushed to the disk. Thus, for each change recorded
in the log, the redo phase determines whether or not the operations have been re-executed.
Undo Phase
In the undo phase, all the transactions that are listed in the active transaction set have to be undone. Thus,
the log should be scanned backward from the end and the recovery manager should undo the necessary
operations. Each time an operation is undone, a compensation log record has been written to the log. This
process continues until there is no transaction left in the active transaction set.
After the successful completion of this phase, a database can resume its normal operations.
3.5.5 Checkpoints
It is observed that the three techniques that one can think of sufficient enough to recover from a failure.
However, every time recovery requires scanning the log completely to find which transactions are
committed and which are not. In addition, depending on the state of the transaction, different recovery
techniques have been applied. In Figure 3, where T1, T2, T3, T4 and T5 are executing concurrently. If a
failure occurs at time t, then it is needed to apply redo recovery for T1, T3 and T4 and undo recovery can
be applied to T2 and T5. Now, it can be seen that T1, T3 and T4 are already committed and their values
are already stored in the database. However, after failure, it is needed to redo all the transaction that is
not efficient in terms of utilisation of resources. Hence, the idea of a checkpoint comes in this situation.
Figure 3 shows the set of interleaved transactions:
17
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
data items that are required by the currently executing transaction. This, at the same time, minimises
the load of log and it is not needed to redo operation in case of a failure for a committed transaction. In
a simple checkpoint, the following conditions should hold:
No new transactions are accepted. The system needs to focus on the currently executing transaction.
Wait is required until all the transactions can commit or abort on that time boundary and then
transfer the log to the stable storage.
Write a log record <CHECKPOINT> in the log so that during the recovery, the system identifies all the
uncommitted transactions, which occur after the checkpoint.
Update the database with all the data items that are updated by these transactions.
After one checkpoint has been finished, the system resumes accepting new transactions.
Therefore, it is needed to apply redo and undo operations for the transactions that begin their
execution after a checkpoint. This simplifies and speeds up the recovery process.
Let’s modify the previous example by applying checkpoint at time t1. Now, if the system fails at time
t, then in the course of recovery, the system should do the following operations:
T1 and T4 will not participate in the recovery process as they are already committed before the
checkpoint and their changes have already been written to the database.
T2 will participate in the recovery process as it is committed after the checkpoint and before the
failure. Thus, it needs to undergo redo recovery process.
For transaction T3, as it starts after the checkpoint and finishes before the failure, it also needs to
undergo redo recovery process. Figure 4 shows an example of a checklist:
t1 Time axis
checkpoint
18
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
19
JGI JAIN DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
4|Mayank|Content Editor|32
5|Rihan|Content Writer|29
Transaction is a unit of database operation. It can be a read operation, a write operation, or a read/
write operation.
Serializability theory in DBMS deals with maintaining the correctness of concurrent execution.
Serial schedule is a type of schedule where no transaction ever interleaved with each other.
Two schedules can be an equivalent schedule if the execution of one schedule leaves the database
in a state.
View serializability has less stringent rules than the rules applied for conflict serializability.
Multi-version 2PL protocol is based on the 2PL protocol and multi-version timestamp-based protocol.
Recovery from transaction failures often entails restoring the database to the most recent consistent
state just prior to the moment of failure.
WAL is a log-based file system recovery mechanism that writes changes to permanent data to a log
file called the WAL log.
The advantage of ARIES recovery technique is that it reduces the recovery time as well as the
overhead of a log.
In analysis phase, the dirty page table has been formed by analysing the pages from the buffer and
also a set of active transactions has been identified.
Redo phase is the second phase where all the transactions that are needed to be executed again take
place.
In the undo phase, all the transactions that are listed in the active transaction set have to be undone.
3.8 GLOSSARY
Transaction: A unit of database operation. It can be a read operation, a write operation, or a read/
write operation
Serializability: The theory in DBMS deals with maintaining the correctness of concurrent execution
Serial schedule: A type of schedule where no transaction ever interleaved with each other
Equivalent schedule: Two schedules can be an equivalent schedule if the execution of one schedule
leaves the database in a state
View serializability: It has less stringent rules than the rules applied for conflict serializability
Multi-version 2PL protocol: It is based on the 2PL protocol and multi-version timestamp-based
protocol
Recovery: The transaction failures often entail restoring the database to the most recent consistent
state just prior to the moment of failure
WAL: A log-based file system recovery mechanism that writes changes to permanent data to a log
file called the WAL log
ARIES: The advantage of the ARIES recovery technique is that it reduces the recovery time as well
as the overhead of a log
20
UNIT 03: Advanced Transaction Processing & Recovery JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Analysis phase: The dirty page table has been formed by analysing the pages from the buffer and
also a set of active transactions has been identified
Redo phase: The second phase where all the transactions that are needed to be executed again take
place
Undo phase: The transactions that are listed in the active transaction set have to be undone
https://www.google.co.in/books/edition/Transaction_Processing/VFKbCgAAQBAJ?hl=
en&gbpv=1&dq=acid+properties+in+dbms&printsec=frontcover
https://cs.stanford.edu/people/chrismre/cs345/rl/aries.pdf
21
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Discuss with your friends and classmates the concept of transaction processing. Also, discuss the
importance of serializability in database management.
22
UNIT
04 Data Security
Names of Sub-Units
Introduction, Database Security Issues, Discretionary Access Control Based on Granting and Revoking
Privileges, Mandatory Access Control, Role-based Access Control for Multilevel Security, SQL Injection,
Introduction to Statistical Database Security, Introduction to Flow Control
Overview
This unit begins by discussing about the concept of data security and database security issues. Next,
the unit discusses the discretionary access control based on granting and revoking privileges and
mandatory access control. Further the unit explains the role-based access control for multilevel
security and SQL injection. Towards the end, the unit discusses the introduction to statistical database
security and flow control.
Learning Objectives
Learning Outcomes
http://www.sis.pitt.edu/jjoshi/IS2935/Fall04/p227-denning.pdf
4.1 INTRODUCTION
Data security refers to the techniques and standards we employ to safeguard digital information against
unauthorised or unintentional access, alterations and disclosure throughout its lifespan. Organisational
strategies, policies, procedures and technology that can secure data against cyberattacks, criminal
intrusions or unintended data spills are also examples of data security.
Physical security of hardware and network assets housing protected data, administrative controls and
policies and logical defensive mechanisms for software accessing the data are all part of security data
practice.
2
UNIT 04: Data Security JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
login credentials match those in the database. This indicates that the user is authorised to access
your database.
If a legitimate user has some data access privileges, he will not be able to access data that he doesn’t
have. Your database is not accessible to any unauthorised or malicious users. As a result, database
authentication is critical for ensuring database security.
Database encryption: One of the most effective methods of database security is encryption, which
protects your database from unauthorised access while information is being stored and transmitted
over the Internet.
AES, MD5 and SHA 1 are examples of encryption algorithms that are used to encrypt and decrypt
various sorts of sensitive data.
Within a database, an encryption method often converts plain text data into ciphertext of
unreadable types. Therefore, if hackers gain access to your database, they will not be able to utilise
your information until it is decrypted.
Encrypting sensitive data before putting it in a database is strongly recommended because it
provides security and protects against cyber-attacks.
Backup database: Backup is a type of database security that is used to recover data in the event of
data loss, corruption, hacking or natural disasters. In real time, it copies or archives the database
on a secondary storage.
If the primary and secondary servers are configured in the same location and the primary server is
destroyed, the secondary server may be destroyed as well. As a result, you will be unable to run your
application and your system will shut down until you are able to recover.
As a result, it is recommended that the backup server be configured in a physically distinct location
to maintain database security. If the primary server is unavailable, the database can be recovered
via the secondary server.
Physical security: Physical database security refers to the safeguarding of the database server
room from unwanted access. The database server should be housed in a secure, climate-controlled
environment.
Application security: To protect against web assaults like SQL injection, you must secure your
application and database. SQL injection is one of the most prevalent web attacks, in which a hacker
takes control of an application’s database in order to steal sensitive information or destroy it.
In this strategy, the attacker uses web page input to introduce malicious code into a SQL query. This
error happens when a programme fails to properly sanitise SQL statements. As a result, an attacker
can use malicious SQL queries to get access to your database and use it for destructive purposes.
You may secure your application against SQL injection attacks by using the following preventative
methods:
Preparation of statements
Utilisation of web application firewall
Updating your computer system
Validating the input of the user
Privileges are restricted.
Use pre-defined procedures
3
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Access Control: You must restrict unauthorised users’ access to the database to guarantee database
security. Only authorised users have access to the database and no unauthorised users have access.
Create user accounts for DBAs who will access the database, assign them a role and restrict what
they may do with it.
As a result, access control is a sort of database security that protects your database by preventing
unauthorised people from accessing it.
Web Application Firewall (WAF): AWAF is a database security that is an application-based cyber
security technology. WAF is a web application firewall that protects applications by screening,
monitoring and blocking dangerous HTTP traffic.
This database security measure restricts who has access to the application and prevents outsiders
from using the internet to access it. A web application firewall, which will safeguard your application
and database from malicious users, should be used to secure your programme.
Use strong password: This is an easy but crucial list of suggestions for ensuring database security.
You should use a strong password for database login as a DBA or IT officer and never disclose your
password with others.
If you use a simple password such as your phone number, employee ID or date of birth, hackers will
try to get in using these passwords. As a result, you’ll lose control of your database.
As a result, develop a strong password for database login that includes a combination of letters,
numbers and special characters (a total of at least 10 characters) and change it on a frequent basis.
Database auditing: Auditing is a form of database security check that can help you detect and
identify unauthorised database access (Database Management System).
Database auditing examines log files on a regular basis to determine who accessed the database,
when they did so, how long they stayed and what they did while there. You can immediately discover
if the database server has been accessed without authorisation.
4
UNIT 04: Data Security JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Data encryption: This method is mostly used to protect sensitive information (such as credit card
numbers and OTP numbers) and other numbers. Some encoding algorithms are used to encode the
data.
Unauthorised users will have a tough time decoding this encoded data, whereas authorised users
are given decoding keys to decode data.
5
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
that subject I has on object j. Each relation R in a database is allocated an owner account, which is often
the account that was used when the relationship was formed in the first place, to govern the granting
and canceling of related rights. All privileges on a relation are granted to the owner. Using the CREATE
SCHEMA command in SQL2, the DBA may designate an owner to an entire schema by establishing the
schema and attaching the relevant authorisation identification with it.
By providing rights to other users’ accounts, the owner account holder can convey privileges on any
of the owned relations to them. Each unique relation R in SQL can be granted the following sorts of
privileges:
R has the SELECT (retrieval or read) privilege: Allows you to retrieve your account. In SQL, this
grants the account the ability to get tuples from R using the SELECT command.
Modification rights on R: This grants the account the ability to alter R’s tuples. This covers three
rights in SQL: UPDATE, DELETE and INSERT. These are the three SQL instructions for changing a table
R. Furthermore, both the INSERT and UPDATE capabilities let the account modify just particular
characteristics of R.
On R, you have the following privilege: When establishing integrity restrictions, the account now
has the ability to reference (or refer to) a relation R. This permission can also be limited to particular
R characteristics. It is important to note that in order to provide the query that corresponds to the
view, the account must have the SELECT permission on all relations included in the view definition.
6
UNIT 04: Data Security JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
A4, A4 loses the privilege entirely. As a result, a DBMS that permits privilege propagation must keep
account of how all rights were issued so that canceling privileges may be done effectively.
7
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
4.4.1 Difference between Discretionary Access Control and Mandatory Access Control
The difference between the DAC and MAC is shown in Table 1:
8
UNIT 04: Data Security JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
9
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
directives to establish organisational processes, guiding principles and courses of action that are deemed
expedient, wise or advantageous. Policies are usually accompanied by a description of the consequences
and actions that will be implemented if the policy is broken. The label security administrator interprets
and converts these rules to a collection of label-oriented policies, which specify the security labels for
data and authorisations for users, these labels and authorisations regulate access to specified protected
objects. Assume a user has SELECT permissions on a table. Label security will automatically analyze
each row returned by the query to verify whether the user has permission to read the data when the
user conducts a SELECT statement on that table. For example, if the user’s sensitivity is set to 20, he
or she can access any rows with a security level of 20 or below. The level specifies how sensitive the
information in a row is; the more sensitive the row, the higher its security label value. Label Security
may also be used to run security checks on UPDATE, DELETE and INSERT statements.
10
UNIT 04: Data Security JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
to organisational roles rather than individual individuals. Individual users are then given roles that
are appropriate for them. The CREATE ROLE and DESTROY ROLE commands can be used to create and
destroy roles. When necessary, the GRANT and Remove commands can be used to assign and revoke
rights from roles as well as individual users. A firm, for example, may include positions such as sales
account manager, buying agent, mailroom clerk, department manager and so on. Each position can be
allocated to several people. Security privileges that are common to a position are provided to the role
name and any individual assigned to this role will have those privileges automatically granted to them.
RBAC can be used in conjunction with standard discretionary and obligatory access controls to
guarantee that only authorised individuals in their designated roles have access to certain data or
resources. Users establish sessions in which they can activate a subset of the roles to which they are
assigned. Each session can be assigned to several roles, but it only maps to one user or one subject. Many
DBMSs support the idea of roles, which allow rights to be granted to roles.
Another significant criterion in many commercial DBMSs is the separation of roles. It is required
to prohibit one user from performing work that needs the participation of two or more persons,
hence preventing collusion. The mutual exclusion of responsibilities is one approach for successfully
implementing separation of duties. If a user cannot employ both roles at the same time, they are said to
be mutually exclusive. Role mutual exclusion may be divided into two types: authorisation time exclusion
(static) and runtime exclusion (dynamic). Two roles that have been declared as mutually exclusive
cannot be part of a user’s authorisation at the same time in authorisation time exclusion. Both of these
roles can be permitted to a single user under runtime exclusion, but they cannot be engaged by the user
at the same time. Another type of mutual exclusion of roles is total and partial exclusion.
The role hierarchy in RBAC is a logical approach to structure roles to represent the lines of authority
and responsibility in the company. Junior positions at the bottom are typically linked to more senior
roles as one goes up the hierarchy. As partial orders, the hierarchic diagrams are reflexive, transitive
and anti-symmetric. In other words, if a person holds one position, he or she is automatically assigned
to roles below in the hierarchy. Defining a role hierarchy entail deciding on the sort of hierarchy and the
roles and then putting the hierarchy into action by assigning roles to other roles. The following method
may be used to construct a role hierarchy:
GRANT ROLE full_time TO employee_type1
GRANT ROLE intern TO employee_type2
The examples above show how to assign the roles of full-time and intern to two different sorts of workers.
Identity management is another security concern. The term “identity” refers to a person’s distinct name.
Because legal names are not always unique, a person’s identification must include enough additional
information to make the whole name unique. Identity Management refers to the process of authorizing
this identity and controlling the schema of these identities. Identity Management is concerned with how
businesses can efficiently verify individuals and regulate their access to sensitive information. It has
grown more obvious as a business need in many industries, affecting businesses of all sizes. Identity
Management administrators must continuously satisfy application owners while controlling costs
and enhancing IT efficiency. Another essential factor to consider in RBAC systems is the possibility of
temporal limits on roles, such as the time and length of role activations and timed triggering of a role
by activation of another function. Using an RBAC architecture to fulfill the major security needs of Web-
based applications is a highly desirable aim. Roles can be given to workflow tasks such that a user has
any of the roles associated with the job is permitted to execute it and can only play a specific role for a
limited time.
RBAC models have a number of appealing characteristics, including flexibility, policy neutrality,
improved support for security management and administration and other characteristics that make
11
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
them appealing candidates for designing secure Web-based applications. These characteristics are
absent in DAC and MAC versions. RBAC models also incorporate the features seen in standard DAC
and MAC policies. Furthermore, an RBAC model includes tools for dealing with security risks that
arise during the execution of activities and processes, as well as for establishing user-defined and
organisation-specific regulations. Another reason for RBAC models’ success is their ease of deployment
through the Internet.
In-band SQLi
The attacker uses the same communication channel to launch assaults and acquire information.
Because of its simplicity and efficiency, in-band SQLi is one of the most prevalent SQLi attacks. This
approach is divided into two sub-variations:
Error-based SQLi: The attacker takes steps that cause the database to generate error messages. The
data provided by these error messages could be used by the attacker to obtain knowledge about the
database’s structure.
Union-based SQLi: This technique makes use of the UNION SQL operator, which combines numerous
database select statements into a single HTTP response. This response may contain information
that the attacker can use.
12
UNIT 04: Data Security JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
HTTP response will alter or remain unchanged depending on the outcome. The attacker can then
determine if the message produced a true or false response.
Time-based: The attacker sends a SQL query to the database, which causes it to wait (for a specified
number of seconds) before responding. The attacker can tell whether a query is valid or false based
on how long it takes the database to answer. An HTTP response will be generated immediately or
after a waiting period based on the result. As a result, the attacker can figure out if the message is
valid.
Out-of-band SQLi
This type of attack is only possible if certain functionalities on the database server used by the web
application are enabled. This type of attack is typically employed as a complement to in-band and
inferential SQLi attacks.
Out-of-band SQLi is used when an attacker can’t launch an attack and gather information over the
same channel or when a server is too slow or unstable to complete these activities. These methods rely
on the server’s ability to send DNS or HTTP requests to send data to an attacker.
13
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Statistical queries are those that only enable aggregate functions like COUNT, SUM, MIN, MAX, AVERAGE
and STANDARD DEVIATION. Statistical inquiries are mostly used to obtain demographic statistics and
to maintain employee databases in businesses and sectors.
Flow control methods must ensure that only allowed flows are implemented, both explicit and implicit.
To maintain safe information transfers, a set of rules must be followed. Rules can be stated using flow
relations between classes and applied to data, describing the permissible flows within a system. (An
information flow from A to B happens when the value of information associated with A is affected by
information connected with B. The flow is caused by activities that transmit information from one item
to another.) These relations can describe, given a class, the set of classes through which information
(classified in that class) can flow or they can specify the exact relations that must be checked between
two classes in order for information to flow from one to the other. In general, flow control mechanisms
establish controls by giving a label to each item and identifying the object’s security class. The flow
relations specified in the model are then validated using labels.
14
UNIT 04: Data Security JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
channels and storage channels. The difference between the two is that information in a timing channel is
transmitted by the timing of events or processes, whereas information in a storage channel is conveyed
by accessing system information or what would otherwise be inaccessible to the user.
Consider a distributed database system with two nodes with user security levels of secret (S) and
unclassified (U). This is a simple example of a covert channel (U). To commit a transaction, both nodes
must agree to commit. They can only do activities that are consistent with the *-property, which indicates
that the S site cannot write or send information to the U site during any transaction. However, if these
two sites conspire to establish a covert channel between them, the U site may commit a transaction
involving secret data unconditionally, but the S site may do so in some predefined agreed-upon way,
allowing certain information to be passed from the S site to the U site, violating the *-property. This is
possible if the transaction is repeated, but the S site’s activities implicitly send information to the U site.
Locking, as we explored in Chapters, prevents concurrent writing of information into the same objects
by users with different security levels, hence prohibiting storage-type hidden channels. Operating
systems and distributed databases give control over the multiprogramming of processes, allowing
resource sharing without the chance of one program or process encroaching on another’s memory or
other system resources, eliminating timing-oriented hidden channels. Covert channels, in general, are
not a serious issue in well-implemented strong database solutions. Certain strategies, however, may be
devised by astute users to communicate information implicitly.
Some security experts feel that one approach to eliminate hidden channels is to prevent programmers
from gaining access to sensitive data that a program would handle after it has been launched. A bank
programmer, for example, does not need to know the identities or balances in depositors’ accounts.
Brokerage company programmers do not need to know what buy and sell orders their clients have.
Access to a form of actual data or some sample test data during program testing may be justified, but
not once the software has been adopted for regular usage.
Database security refers to a set of techniques intended to protect database management systems
against hostile cyber-attacks and unauthorised use.
Database authentication is a category of database security that verifies a user’s login identifications
before storing them in the database.
Row-level access control allows for the implementation of sophisticated access control rules by
examining the data row by row.
A Label Security policy is a policy that is defined by an administrator.
The Directory Services Markup Language (DSML) is an XML-based representation of directory
service information.
RBAC evolved fast as a validated technique for managing and enforcing security in large-scale
enterprise-wide systems.
SQL injection, often known as SQLI, is a typical attack vector in which malicious SQL code is used to
manipulate backend databases.
Statistical databases are databases that hold information about a large number of people and are
mostly used to generate statistics on diverse groups.
Flow control governs the distribution or flow of information among items that are accessible.
A covert channel enables the transmission of information that breaches security or policy.
15
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
4.10 GLOSSARY
Database security: A set of techniques intended to protect database management systems against
hostile cyber-attacks and unauthorised use
Database authentication: A category of database security that verifies a user’s login identifications
before storing them in the database
Row-level access control: The implementation of sophisticated access control rules by examining
the data row by row
Label security policy: A policy that is defined by an administrator
Directory Services Markup Language (DSML): An XML-based representation of directory service
information
RBAC: A validated technique for managing and enforcing security in large-scale enterprise-wide
systems
SQL injection: A typical attack vector in which malicious SQL code is used to manipulate backend
databases
Statistical database: A databases that hold information about a large number of people and are
mostly used to generate statistics on diverse groups
Flow control: The distribution or flow of information among items that are accessible
Covert channel: The transmission of information that breaches security or policy
16
UNIT 04: Data Security JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
3. Many commercial DBMSs now employ the idea of row-level access control, which allows for the
implementation of sophisticated access control rules by examining the data row by row. Refer to
Section Mandatory Access Control (MAC)
4. SQL injection, often known as SQLI, is a typical attack vector in which malicious SQL code is used to
manipulate backend databases and get access to data that was not intended to be revealed. Refer to
Section SQ Injection
5. Flow control governs the distribution or flow of information among items that are accessible. When
a program receives data from X and writes values into Y, a flow between objects X and Y happens.
Refer to Introduction to Flow Control
https://www.itopstimes.com/itsec/creating-a-multi-layered-security-architecture-for-your-
databases/
https://dzone.com/articles/10-common-database-security-issues
Discuss the concept of data security with your friends and classmates. Also, try to find some real
world examples of data security
17
UNIT
Names of Sub-Units
Operation on Files, Hashing Techniques, Types of Single-Level Ordered Indexes, Multi-Level Indexes,
Dynamic Multilevel Indexes Using B Trees and B+ Trees, Indexes on Multiple Keys
Overview
This unit begins by discussing about the concept of storage and indexing. Next, the unit discusses the
operation on files, hashing techniques, types of single-level ordered indexes and multi-level indexes.
Further the unit explains the dynamic multilevel indexes using B tress and B+ tress. Towards the end,
the unit discusses the indexes on multiple keys.
Learning Objectives
Learning Outcomes
http://www.csbio.unc.edu/mcmillan/Comp521F16/Lecture16.pdf
5.1 INTRODUCTION
Files and documents are digitally recorded and kept in a storage system for future use, which is what
data storage entails. If necessary, storage systems may rely on electromagnetic, optical, or other media
to maintain and recover data. Data storage makes it simple to backup information for safekeeping and
speedy recovery in the case of a computer catastrophe or cyberattack.
Data can be stored physically on hard drives, disc drives, USB drives, or digitally on the cloud. The crucial
thing is ensuring your files are backed up and easily accessible in the event that your systems fail beyond
repair. In terms of data storage, some of the most significant variables to consider are dependability, the
robustness of security features, and the cost to construct and maintain the infrastructure. Examining
various data storage options and applications will assist you in making the right decision for your
company’s needs.
An index (IDX) is a data structure established on columns in a database table to significantly speed up
data retrieval operations in database systems. An index is a subset of a database table that is sorted
by key values. In the absence of an index, query languages such as SQL may have to traverse the whole
table from top to bottom to choose relevant rows. Indexes on database tables must be established with
caution since they take space and are kept in the database. Because the index must be updated when the
underlying table is modified, an index might cause write operations to take longer.
Indexes are created using various data structures such as B+ trees, balanced trees, and hashes. A
scan of the full database table is quite simple when using these data formats. Indexes on functions or
expressions can also be built. A unique index avoids duplicate entries in the index and the associated
table, functioning as a constraint. The arrangement of the columns is also a crucial part of indexing.
The first indexed column is always prioritised; row IDs are obtained using the first indexed columns. In
most cases, indexes do not deliver a value but rather find records.
2
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Consider the following EMPLOYEE file: Name, SSN, Salary, Job_Code, and Department. A basic selection
criterion might be an equality comparison on a field value, such as (Ssn = ‘123456789’) or (Department
= ‘Research’). Other sorts of comparison operators, such as > or ≥ can be used in more complicated
situations, as in (Salary ≥ 30000). In general, the selection condition is an arbitrary Boolean expression
on the file’s fields.
File search processes are often based on basic selection criteria. The DBMS (or the programmer) must
break down a complicated condition to obtain a basic condition that may be utilised to find records on a
disc. Each found record is then verified to see if it meets the whole selection requirement. For example, we
may separate the simple condition (Department = ‘Research’) from the complicated condition ((Salary
≥ 30000) and (Department = ‘Research’)); each record satisfying (Department = ‘Research’) is found and
evaluated to determine if it also meets (Salary ≥ 30000).
When numerous file records meet a search requirement, the first record in the physical sequence of file
records is discovered and marked as the current record. Subsequent search operations begin with this
record and find the next record in the file that meets the requirement.
The actual processes for identifying and accessing file records differ from one system to the next. We
give a selection of sample operations below. These instructions are often used by high-level programs,
such as DBMS software applications, to access records, hence we refer to program variables in the
following descriptions:
Official: This function prepares the file for reading or writing. Allocates sufficient buffers (usually
at least two) to contain disc file blocks and obtains the file header. Sets the file pointer to the file’s
beginning.
3
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Restart: Sets the file pointer of an open file to the file’s beginning.
Find (or find out): Looks for the first record that meets a search requirement. Transfers the record’s
block into the main memory buffer (if it is not already there). The file pointer points to the buffer
record, which becomes the current record. Different verbs are sometimes used to specify whether
the found record is to be retrieved or modified.
Read (or obtain): Copies the current record from the buffer to a user-defined program variable. This
operation may also advance the current record pointer to the next record in the file, necessitating
the reading of the next file block from the disc.
Identify next: Finds the next entry in the file that matches the search criterion. Transfers the record’s
block into the main memory buffer (if it is not already there). The record is placed in the buffer and
is now the current record. In legacy DBMSs based on the hierarchical and network models, many
variations of Find Next are available (for example, Find Next record within a current parent record,
Find Next record of a specified kind, or Find Next record when a complicated condition is satisfied).
Remove: Deletes the current record and (at some point) modifies the disc file to reflect the deletion.
Adaptation: Changes certain field values for the current record and (at some point) changes the file
on the disc to reflect the change.
Insert: Insert a new record into the file by finding the block to be inserted, moving that block into
the main memory buffer (if it is not already there), writing the record into the buffer, and (finally)
copying the buffer to disc to reflect the insertion.
Close: Completes the file access by releasing the buffers and doing any other clean-up procedures
that are required. Except for Open and Close, the preceding activities are known as record-at-a-time
operations since each one applies to a single record. It is possible to combine the procedures Find,
Find Next, and Read into a single operation called Scan, which is described below.
Scanner: Scan returns the first record if the file was recently opened or reset; otherwise, it returns
the next record. If a condition is supplied, the returned record is the first or next record that meets
the criteria. In database systems, a file can be subjected to extra set-at-a-time higher-level actions.
The following are some examples:
Find all: Finds all of the records in the file that match a search criterion.
N find (or locate): Searches for the first record that meets a search requirement and then moves
on to the following n – 1 record that meet the same condition. The blocks holding the n records are
transferred to the main memory buffer (if not already there).
Find ordered: Returns all of the records in the file in the order requested.
Reorganization: Begin the process of restructuring. As we will see, some file systems need periodic
rearrangement. As an example, suppose you want to reorganise the file entries by sorting them by
a certain field.
Unordered record files (heap files): Entries are stored in the file in the order in which they are
inserted in this simplest and most basic style of organisation, therefore fresh records are added at
the end of the file. A heap or pile file is a type of file that organises data in unordered manner. This
arrangement is frequently used in conjunction with extra access pathways. It is also used to collect
and save data records for future reference.
4
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Inserting a new record is a fairly efficient operation. The file’s last disc block is copied into a buffer,
the new record is appended, and the block is then overwritten to disc. The file header contains the
address of the final file block. However, looking for a record using any search criterion necessitates
a linear search over the file block, which is a time-consuming technique. If just one record matches
the search requirement, the software will typically read into memory and search half the file blocks
before finding the record. On average, for a file of b blocks, this involves searching (b/2) blocks. If no
records or multiple records fulfill the search criterion, the software must read and search the whole
file’s b blocks.
To remove a record, software must first locate its block, copy it into a buffer, delete the record from
the buffer, and then rewrite the block to the disc. This frees up space on the disc block. This method of
deleting a large number of data wastes storage space. Another method for record deletion is to store
an additional byte orbit, known as a deletion marker, with each record. Setting the deletion marker
to a certain value deletes a record. A different value for the marker shows that the record is still valid
(it has not been destroyed). When executing a search, search systems evaluate only valid records
in a block. Both of these deletion methods need periodic file restructuring in order to recoup the
wasted space of deleted entries. The file blocks are read sequentially throughout the reorganisation,
and records are packed by eliminating deleted entries. Following such a restructuring, the blocks
are once again filled to capacity. Another option is to insert new records into the space left by deleted
records, albeit this involves more accounting to keep track of vacant spots.
For an unordered file, we may use either spanned or unspanned organisation, and it can be used
with either fixed-length or variable-length entries. Because the new record may not fit in its original
place on the disc, modifying a variable-length record may necessitate removing the old record and
inserting a modified record.
Ordered records file (sorted files): We can physically arrange the records in a file on the disc by
the values of one of its fields, known as the ordering field. This results in a file that is ordered or
consecutive. If the ordering field is also a key field of the file, that is, a field that is guaranteed to have
a unique value in each record, the field is referred to as the ordering key for the file.
Internal hashing is a term used to describe the process for internal files, hashing is often done as a hash
table using an array of records. Assume the array index range extends from 0 to M – 1 and then we have
M slots whose addresses correspond to the array indexes. We select a hash function that converts the
hash field value to an integer between 0 and M 1. The h(K) = K mod M function, for example, returns the
5
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
remainder of an integer hash field value K after division by M; this value is then utilised for the record
address. Figure 1 depicts the hashing techniques:
Hashing Techniques
Internal hashing data structures. (a) Array of M Positions for use in internal
hashing. (b) Collision resolution by chaining records.
0
1
2
3
M=2
M-1
Overflow pointer
(b) 0
-1
M
1
-1
2
-1
3
M+2
4
Address space
M+1
M-2
-1
M-1
M+5
M
-1
M+1
M+4
M+2
Overflow space
M+0-2
M+0-1
Null pointer= -1
Overflow pointer refers to position of next record in linked list
Before using the mod function, no integer hash field values can be converted to integers. The numeric
(ASCII) codes associated with characters can be employed in the transformation for character strings,
for example, by multiplying such code values. for a hash field whose data type is a 20-character string.
We will suppose that the code function produces a character’s numeric code and that we are given a
hash field value K of type K: array [1.20] of char (in Pascal) or char K [20] (in C).
6
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Bucket
Number Block address on disk
0
M-2
M-1
7
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
For example, if employee numbers at a firm are assigned as 1, 2, 3,... up to the entire number of workers,
we may use the identity hash function to keep order. Unfortunately, this only works if keys are created
in the correct sequence by some program.
Because a set number of buckets M are assigned, the hashing algorithm described thus far is known as
static hashing. This can be a significant disadvantage for dynamic files. Assume we have M buckets for
the address space and m is the maximum number of records that can fit in one bucket. Then only (m *
M) records will fit in the available area. If the number of records is significantly less than (m * M), we are
left with a lot of wasted space. On the other hand, if the number of records exceeds (m * M), there will be
multiple collisions and retrieval will be hindered due to the large lists of overflow data.
In either instance, we may need to adjust the number of blocks M allocated and then redistribute
the records using a new hashing function (depending on the new value of M). For huge files, these
reorganisations might take a long period. Newer hashing-based dynamic file organisations allow the
number of buckets to alter dynamically with just localised rearrangement, as shown in Figure 3:
Main buckets
Bucket 0 340
360
Record pointer
NULL
Overflow buckets
Record pointer
72 Record pointer
NULL
522 Record pointer
Record pointer
Bucket 3 399
89
NULL
8
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
One significant disadvantage of the previously described static hashing approach is that the hash
address space is fixed. As a result, dynamically expanding or shrinking the file is challenging. The first
approach, extendible hashing, save an access structure in addition to the file and is therefore comparable
to indexing in certain ways. The primary distinction is that the access structure is dependent on the
values returned by the hash function when applied to the search field. The access structure in indexing
is determined by the values of the search field itself. The second method, known as linear hashing, does
not require any additional access structures. Another approach, known as dynamic hashing, employs
an access mechanism based on binary tree data structures.
Hashing that is extensible. In extendible hashing, a kind of directory an array of 2d bucket addresses is
kept, where d is the directory’s global depth. The integer value corresponding to the first (high-order) d
bits of a hash value is used to determine a directory entry, and the address in that entry identifies the
bucket in which the associated records are stored. However, each of the 2d directory locations does not
have to have its own bucket. If all the records that hash to these locations fit in a single bucket, many
directory locations with identical initial d bits for their hash values may have the same bucket address.
The number of bits on which the bucket contents are based is specified by a local depth d kept with each
bucket. A directory with a global depth of d = 3.
The value of d can be changed one at a time, thus doubling or halving the number of items in the directory
array. If a bucket with a local depth d is equal to the global depth d overflows, doubling is required. If d
> d for all buckets after certain deletions, the halving happens. The majority of record retrievals need
two-block visits, one to the directory and one to the bucket.
Assume that a newly added record produces an overflow in the bucket whose hash values begin with
01 the third bucket demonstrate bucket splitting. The records will be divided into two buckets: the first
will contain all records whose hash values begin with 010, and the second will contain all records whose
hash values begin with 011. The two new separate buckets are now pointed to the two directory locations
010 and 011. They used to point to the same bucket before the divide.
The local depth of the two new buckets is three, one greater than the previous bucket’s local depth. If
an overflowing and split bucket is used to have a local depth equal to the directory’s global depth d,
the directory’s size must now be twice so that one may utilise an additional bit to identify the two new
buckets. For example, if the bucket for records with hash values beginning with 111 overflows, the two
new buckets require a directory with global depth d = 4, because the two buckets are now designated
1110 and 1111, and hence their local depths are also 4. As a result, the directory size is doubled, and all of
the other original locations that are included are divided into two places, each of which has the same
pointer value as the original location.
9
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Local depth of
Directory back bucket Date file buckets
000 d=3
Bucket for records
001
whose hash values
010
start with 000
011
100
101 d=3 Bucket for records
110 whose hash values
111 start with 001
Global depth
d=3
d=2
Bucket for records
whose hash values
start with 01
d=2
Bucket for records
whose hash values
start with 010
d=3
Bucket for records
whose hash values
Structure of the extensible start with 111
hashing scheme.
10
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
11
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
hi+1 via a delayed split of respective buckets. There is no directory; all that is required to detect which
buckets have been divided is a variable n, which is initially set to 0 and is increased by 1 every time a
split happens. To get a record with the hash key value K, first, apply the function hi to K; if hi (K) n, then
apply the function hi+1 on K since the bucket has already been divided. Initially, n = 0, indicating that
the function hi applies to all buckets; when buckets are separated, n rises linearly. When n equals M after
being incremented, it means that all of the original buckets have been divided, and the hash function
hi+1 applies to all data in the file. At this point, n is reset to 0 (zero), and any subsequent collisions that
result in the overflow will result in the application of a new hashing function hi+2(K) = K mod 4M. In
general, a succession of hashing functions hi+j (K) = K mod (2jM) is employed, where j = 0, 1, 2,...; a new
hashing function hi+j+1 is required every time all buckets 0, 1,..., (2jM) 1 are divided and n is reset to 0..
12
UNIT 05: Storage and Indexing JGI JAIN DEEMED-TO-BE UNIVERSIT Y
Akers, Jan
Block 3 Alexander, Ed
Alfred, Bob
Allen, Sam
Anderson, Rob
Archer, Sue
Atkins, Timothy
Wood, donald
Wyatt, Charles
13
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
For two reasons, the index file for the main index takes up significantly less size than the data file. For
starters, there are fewer index entries than records in the data file. Second, because each index entry
has just two fields, it is often less in size than a data record; as a result, more index entries than data
records can fit in a single block. As a result, a binary search on the index file necessitates fewer block
visits than a binary search on the data file, is shown in Table 1:
Table 1: Average Access Times for a File of b Blocks under Basic File Organisations
Type of Organisation Access/Search Method Average Blocks to Acess a Apecific Record
Heap (unordered) Sequential scan (linear search) b/2
Ordered Sequential Scan b/2
Ordered Binary search log2b
Note that Log2 b block accesses were required for the binary search for an ordered data file. However,
if the primary index file only includes bi blocks, then using a search key to locate a record is impossible.
The primary indexes on the ordering key field of the file is shown in Figure 7:
(Primary
key field)
Primary index on the ordering key field of
the file shown in figure 17.7 Name Ssn Birth_date Job Salary Sex
Aaron, Ed
Abbott, Diane
Adams, John
Acosta, Marc
Adams, Robin
Akers, Jan
Anderson, Rob
Anderson, Zach
Angeli, Joe
Archer, Sue
Arnold, Mack
Arnold, Steven
Atkins, Timothy
Wong, James
Woods, Manny
Wood, donald
Wright, Pam
Zimmer, Byron
Wyatt, Charles
14
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
A binary search of that index and access to the block storing that record are required for value: a total
of log2 bi + 1 access.
A record with the primary key value K is located in the block with the address P(i), where K(i) ≤ K < K (i +
1). Because of the physical ordering of the file entries based on the primary key field, the ith block in the
data file includes all such records. To obtain a record, we do a binary search on the index file to discover
the relevant index entry I and then retrieve the data file block whose address is P. (i). Example shows the
reduction of block accesses that may be obtained when the main index is utilised to search for a record.
The addition and deletion of records is a key issue with the main index, as it is with any ordered file.
With the main index, the problem is exacerbated since inserting a record in its right place in the data
file necessitates not only relocating records to make room for the new record but also changing certain
index entries, because shifting records changes the anchor records of some blocks. This problem can
be mitigated by using an unordered overflow file. Another option is to utilise a linked list of overflow
entries for each data file block. This is comparable to how hashing is used to cope with overflow records.
To optimise retrieval performance, records within each block and its overflow linked list can be sorted.
Deleted records are handled using deletion marks.
A clustering index is similarly an ordered file with two fields in which, the first is of the same type
as the data file’s clustering field, and the second is a disc block pointer. Each different value of the
clustering field has one item in the clustering index, which contains the value as well as a link to the first
block in the data file that has a record with that value for its clustering field. Because the data records
are physically ordered, record addition and deletion continue to cause issues. To relieve the insertion
difficulty, it is standard practice to reserve a whole block (or a cluster of contiguous blocks) for each
value of the clustering field; all entries with that value are inserted in the block (or block cluster). This
makes insertion and deletion quite simple.
15
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
A secondary index access structure based on a key (unique) field with a different value for each entry.
A field of this type is commonly referred to as a secondary key; in the relational model, this would
correspond to any UNIQUE key property or the main key attribute of a table. In this example, each
record in the data file has one index entry, which contains the value of the field for the record as well as
a reference to the block in which the record is placed or to the record itself. Hence, such an index is dense.
Figure 8 depicts secondary indexes:
(Clustening
field)
2
2
3
3
Index File 3
(<k(i),P(i)> entries) 4
4
Clustening Block
field value pointer
5
1 5
2 6
6
3
4
5 6
6
6
6
8
6
6
8
A clostening index on the Deps number ordering nonkey 8
field of an EMPLOYEE file.
8
Again, we refer to the two-index entry I field values as <K(i), P(i). The elements are arranged by the value
of K(i), allowing us to do a binary search. We cannot utilise block anchors because the entries in the
data file are not physically ordered by the values of the secondary key field. As a result, an index entry
is constructed for each record in the data set.
16
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Block pointer
NULL pointer
2
2
Block pointer
NULL pointer
3
3
3
3
Index File Block pointer
(<K(i),P(i)> entries)
3
Clustening Block Block pointer
field value pointer
1 NULL pointer
4
2 4
3
4 Block pointer
5
5 NULL pointer
6
8 5
5
5
Block pointer
NULL pointer
6
6
6
6
Block pointer
Block pointer
NULL pointer
8
8
8
Block pointer
NULL pointer
As opposed to the main index, which is used for each block, a secondary index is used for the whole
file. A secondary index with pointers P(i) in the index elements that are block pointers rather than
record pointers. After transferring the necessary disc block to the main memory buffer, a search for the
requested record inside the block may be performed.
17
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
A dense secondary index (with block pointers) on a nonordering key field of a file.
Index
field value Data file
Index File
Indexing field
(<K(i),P(i)> entries)
( secondary key field
Index Block 8
field value pointer
5
1
13
2
8
3
4
6
5
19
6
13
7
17
8
21
9 11
10 16
11 2
12
13 12
14
7
15
18
16 22
17 20
18 20
19 20
20 20
21
22 20
23 20
24 20
20
Because of the increased number of items, a secondary index often requires more storage space and a
longer search time than a primary index. However, the improvement in search time for an arbitrary
record is significantly higher for a secondary index than for the main index, since if the secondary index
18
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
did not exist, we would have to execute a linear search on the data file. Even if the index did not exist, we
could still do a binary search on the main file to find it.
The value b fri is known as the fan-out of the multilevel index, and it is denoted by the sign (fo). Unlike a
binary search, which divides the record search space into two halves at each step, the multilevel index
divides it n-ways (where n = the fan-out) at each search step. If the fan-out is more than 2, searching
a multilevel index consumes roughly (logfobi) block accesses, which is a much lower amount than
searching a binary index. Most of the time, the fan-out is substantially greater than 2. A multilayer
index regards the index file, which we will now refer to as the first (or base) level of a multilevel index, as
an ordered file with a unique value for each K. (i).
As a result, by treating the first-level index file as a sorted data file, we may generate a main index for
the first level; this index to the first level is referred to as the multilevel index’s second level. Because the
second level is the main index, we may utilise block anchors to have one item in the second level for each
block in the first level. Because all index entries are the same size (one field value and one block address),
the blocking factor bfri for the second level (and all future levels) is the same as for the first. If the first
level of the index contains r1 entries and the blocking factor (also known as the fan-out) for the index is
bfri = fo, then the first level requires [(r 1 /fo)] blocks, which is the number of entries r2 required at the
second level of the index.
This procedure can be repeated for the second level. The third level, which is a main index for the second
level, contains one entry for each second-level block, thus r 3 = [(r 2 /fo)]. We need a second level only if
the first level requires more than one block of disc storage, and we need a third level only if the second
level requires more than one block. We can repeat this method until all of the entries of some index level
t fit in a single block.
The top index level refers to this block at the level. Because each level lowers the number of entries at the
preceding level by a factor of (fo) the index fan-out we may determine t using the formula 1 (r 1 / ((fo) t)).
As a result, a multilevel index with r 1 first-level entries will contain t levels, where t = [(logf0 (r 1)]). At
each level of the index, a single disc block is retrieved. As a result, for an index search, t disc blocks are
accessed, where t is the number of index levels.
The multilevel method presented here may be used for any sort of index, whether primary, clustering,
or secondary, as long as the first-level index contains separate K(i) values and fixed-length entries. A
multilayer index constructed over the main index.
19
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
An example shows how using a multilayer index to search for a record improves the number of blocks
accessed, as shown in Figure 11:
15
21
24
Second (top) 29
level
2 35 35
35 39 38
68 44
39
85 51
41
44
46
51
52
55 55
89 58
71
96
80
68
71
78
80
85
82
86
88
20
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
of a node n and the subtrees of all n’s child nodes. The root node in the Figure 12 is A, and its child nodes
are B, C, and D. Leaf nodes are E, J, C, G, H, and K. Because the leaf nodes are located at different levels of
the tree, this tree is said to as imbalanced. Figure 12 depicts dynamic multi-level indexes using B-Trees
and B+ Trees:
B C D Nodes at level 1
E F G H I Nodes at level 2
J K Nodes at level 3
We start with search trees and then go on to B-trees, which may be used as dynamic multilevel indexes
to help direct the search for entries in a data file. B-tree nodes are kept between 50 and 100 percent filled,
and data block references are saved in both internal and leaf nodes of the B-tree structure. B+-trees is a
variant of B-trees in which pointers to a file’s data blocks are only maintained in leaf nodes, resulting in
fewer levels and higher-capacity indexes. The B+-tree structure is the most often used indexing structure
in today’s DBMSs.
<P1, K1, P2, K2… Pq-1, Kq-1, pq, where q ≤ p. Each Pi is a pointer to a child node (or a NULL pointer), and
each Ki is a search value from some ordered set of values. All search values are assumed to be unique.
Two constraints must hold at all times on the search tree:
Within each node, K1 < K2 < ... < Kq−1
For all values X in the subtree pointed at by Pi, we have Ki−1 < X < Ki for 1 < i < q; X < Ki for i = 1; and
Ki−1 < X for i = q
21
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Whenever we look for a value X, we use the calculations in condition 2 above to follow the relevant
pointer Pi. A search tree with an order of p = 3 and integer search values. It should be noted that some
of the pointers Pi in a node may be NULL pointers.
To look for entries in a disc file, we can use a search tree as a technique. The values in the tree can be
the contents of one of the file’s fields, known as the search field (which is the same as the index field if a
multilevel index guides the search). Each key value in the tree corresponds to a reference to the record in
the data file that contains that value. The pointer might also be to the disc block containing that record.
By designating each tree node to a disc block, the search tree may be saved on a disc. When a new record
is added to the file, we must update the search tree by adding an entry to the tree that contains the new
record’s search field value as well as a reference to the new record. Figure 13 depicts a node on a search
tree with pointers:
X X X
3 6 8
1 7 8 12
This restrictions can be relaxed, if the index is on a nonkey field, duplicate search refues may
exed and the nude structure and the navigation rules in the tree may be
Algorithms are required for adding and removing search values into and from the search tree while
adhering to the previous two requirements. In general, these techniques do not ensure that a search tree
is balanced, that all of its leaf nodes are at the same level. Because it has leaf nodes at levels 1, 2, and 3,
the tree is unbalanced. The following are the objectives for balancing a search tree:
To ensure that nodes are distributed equally, such that the depth of the tree is reduced for the given
set of keys and the tree is not skewed with some nodes at extremely deep levels.
To uniformly speed up the search, such that the average time to discover any random key is nearly
the same.
While.limiting the number of levels in the tree is one aim, another implicit goal is to ensure that the
index tree does not require excessive restructuring when entries are added to and deleted from the main
file. As a result, we want the nodes to be as full as possible, and we don’t want any nodes to be empty as a
result of too many removals. When records are deleted, certain nodes in the tree may become practically
empty, wasting storage space and increasing the number of levels. The B-tree solves both of these issues
by imposing extra limits on the search tree.
22
UNIT 05: Storage and Indexing JGI JAIN DEEMED-TO-BE UNIVERSIT Y
5.6.2 B-Trees
Additional restrictions in the B-tree ensure that the tree is always balanced and that the space spent
by deletion, if any, is never excessive. However, in order to retain these limits, the insertion and
deletion algorithms grow increasingly complicated. Nonetheless, most insertions and deletions are
straightforward procedures; they get hard only in rare cases, such as when we try an insertion into an
already full node or a deletion from a node that leaves it less than half full. In more technical terms, a
B-tree of order p can be described as follows when used as an access structure on a key field to search
for entries in a data file:
Each internal node in the B-tree is of the form:
<P1,<K1, Pr1>, P2 ,<K2, Pr2>,…<Kq-1, Prq-1 , ..., , Pq >
where q ≤ p. Each Pi is a tree pointer—a pointer to another node in the Btree. Each Pri is a data
pointer8—a pointer to the record whose search key field value is equal to Ki (or to the data file block
containing that record).
Within each node, K1 < K2 < ... < Kq−1.
For all search key field values X in the subtree pointed at by Pi (the ith subtree, we have:
Ki–1 < X < Ki for 1 < i < q; X < Ki for i = 1; and Ki–1 < X for i
= q.
Each node has at most p tree pointers. Figure 14 depicts (a) a node in a B-tree with q-1 search values
and (b) a B-tree of oder p=3:
Tree Tree
pointer pointer
Tree Data Data Data Data
pointer pointer pointer pointer pointer
X Tree X X
pointer
X<K1 Ki-1<X<Ki Kq-1<x
(b)
5 8 Tree node pointer
Data pointer
1 3 8 7 8 12
B-tree structures. (a) A node in a B-tree with q-1 search values (b) A B-tree
of order p=3. The values were inserted in the order 8, 5, 1, 7, 3, 12, 8, 6.
Figure 14: (a) A Node in a B-Tree with p-1 search values and (b) A B-tree of Order p=3:
Except for the root and leaf nodes, each node contains at least [(p/2)] tree pointers. Unless it is the
sole node in the tree, the root node contains at least two tree pointers.
A node that holds q tree pointers, q ≤ p, has q – 1 search key field values (and hence q – 1 data
pointers).
All of the leaf nodes are on the same level. The structure of leaf nodes is the same as that of internal
nodes, except that all of their tree pointers Pi are NULL.
23
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
The above Figure 14 depicts a B-tree of order p = 3. Because we expected that the tree would be utilised
as an access structure on a key field, all search values K in the B-tree are unique. If we utilise a B-tree
on a non-key field, we must modify the definition of the file pointers Pri to refer to a block—or a cluster
of blocks—that contains the file record pointers. This additional degree of indirection is analogous for
secondary indexes.
At level 0, a B-tree has a single root node (which is also a leaf node) (zero). When the root node is filled
with p – 1 search key values and we try to put another item into the tree, it divides into two nodes at level
1. Only the middle value is retained in the root node, while the remaining values are distributed evenly
across the other two nodes. When a non-root node becomes full and a new entry is entered into it, the
node is split into two nodes at the same level, and the middle entry is relocated to the parent node along
with two pointers to the newly split nodes. If the parent node is full, it is divided as well. Splitting can
spread all the way to the root node, resulting in the formation of a new level if the root is split.
If a node becomes less than half-filled due to the deletion of a value, it is merged with its nearby nodes,
and this can even spread all the way to the root. As a result, deletion has the potential to lower the
number of tree levels. Analyses and simulations have revealed that after several random insertions
and deletions on a B-tree, the nodes are around 69 percent filled when the number of values in the
tree stabilises. This also applies to B+ trees. If this occurs, node splitting and combining will occur only
seldom, making insertion and deletion relatively efficient. If the number of values increases, the tree
will expand without issue though node splitting may occur, making certain insertions take longer. Each
B-tree node can have a maximum of p tree pointers, p data pointers, and p search key field values.
In general, a B-tree node may carry extra information required by the tree-manipulating algorithms,
such as the number of entries q in the node and a reference to the parent node.
5.6.3 B+-Trees
The majority of dynamic multilevel index implementations employ a variant of the B-tree data structure
known as a B +-tree. Every value of the search field appears once in the tree, along with a data pointer,
in a B-tree. Data pointers are only kept at the tree’s leaf nodes in a B+-tree; hence, the structure of leaf
nodes differs from the structure of inside nodes. If the search field is a key field, the leaf nodes include
an entry for each value of the search field, as well as a data reference to the record (or to the block that
contains this record). The pointer for a non-key search field leads to a block containing pointers to data
file records, adding an extra degree of indirection. The leaf nodes of the B+-tree are often connected to
allow ordered access to the entries via the search field. These leaf nodes are analogous to an index’s
initial (base) level. The B+-tree’s internal nodes correspond to the other levels of a multilevel index. To aid
the search, some search field values from the leaf nodes are repeated in the interior nodes of the B+-tree.
The interior nodes of a B+- tree of order p has the following structure:
Each internal node is of the form
<P1,K1,p2,K2,…Pq-1,Pq>
where q ≤ p and each Pi is a tree pointer
Within each internal node, K1 < K2 < ... < Kq−1
24
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
For all search field values X in the subtree pointed at by Pi , we have Ki−1 < X ≤ Ki for 1 < i < q; X ≤ Ki
for i = 1; and Ki−1 < X for i = q
Each internal node has at most p tree pointers.
Each internal node, except the root, has at least [(p/2)] tree pointers. The root node has at least two
tree pointers if it is an internal node.
An internal node with q pointers, q ≤ p, has q − 1 search field values.
The structure of the leaf nodes of a B+-tree of order p in Figure 15is as follows: 1. Each leaf node is of
the form:
<<K1,Pr1>,<K2,Pr2>,….<Kq-1,Prq-1>,Pnext>
where q ≤ p, each Pri is a data pointer, and Pnext points to the next leaf node of the B+-tree. Figure 15
depicts (a) internal node of a B-tree with q-1 search values & (b) leaf node of a B-tree with q-1search
values and q-1 data pointers:
The nodes of a B+-tree., (a) Internal node of a B+-tree with q-1 search values. (b) Leaf node of
a B+-tree with q-1 search values and q-1 data pointers.
(a)
X X X
Figure 15: (a) Internal Node of a B-Tree with q-1 Search Values &
(b) Leaf Node of a B-Tree with q-1Search Values and q-1 Data Pointers:
Within each leaf node, K1 ≤ K2 ... , Kq−1 , q ≤ p.
Each Pri is a data pointer that leads to the record with the search field value Ki or to a file block that
contains the record (or to a block of record pointers that point to records whose search field value is
Ki if the search field is not a key).
At least (p/2) values are present in each leaf node.
All of the leaf nodes are on the same level.
Except for the P next pointer, which is a tree that refers to the next leaf node, pointers in internal nodes
are tree pointers to blocks that are tree nodes, but pointers in leaf nodes are data pointers to data file
records or blocks. It is feasible to traverse leaf nodes as a linked list by beginning at the leftmost leaf
node and utilising the P next pointers. This allows for more organised access to the data entries on
25
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
the indexing field. It is also possible to provide a P prior pointer. Because a B+-tree on a non-key field
requires an extra degree of indirection, the Pr pointers are block pointers to blocks that contain a series
of record pointers to the actual records in the data file.
Because entries in a B+-tree’s internal nodes include search values and tree pointers but no data pointers,
a B+-tree’s internal node can hold more entries than a similar B-tree. As a result, given the same block
(node) size, the order p of the B+-tree is greater than that of the B-tree, as illustrated. This can result in
fewer B+-tree levels, reducing search time. Because the topologies of internal and leaf nodes in a B+-tree
differ, the order p might vary. We’ll use p to represent the internal node order and Pleaf to represent the
leaf node order, which we define as the maximum number of data pointers in a leaf node.
Consider an EMPLOYEE file with the characteristics Dno (the department number), Age, Street, City,
Zip code, Salary, and Skill code, as well as the key SSN (Social Security number). Consider the following
question: List the employees in Department 4 whose age is 59. It’s worth noting that Dno and Age are both
non-key properties, which means that a search for any of these will return several results. Alternative
search techniques to examine include:
Assuming Dno has an index but Age does not, use the index to retrieve the records with Dno = 4 and
then choose the records that fulfill Age = 59 from among them.
Alternatively, if Age is indexed but Dno is not, use the index to find the records with Age = 59 and then
choose the records that fulfil Dno = 4 from among them.
If indexes on Dno and Age have been constructed, both indexes may be utilised; each delivers a set
of data or a set of pointers (to blocks or records). When these two sets of records or pointers are
intersected, the records or pointers that fulfil both requirements are returned.
All of these options finally provide the proper result. However, if the number of records that fit either
criterion (Dno = 4 or Age = 59) is considerable, but only a few records satisfy the combined requirement,
then none of the above techniques are efficient for the current search request. There are several options
for using the combination Dno, Age> or Age, Dno> as a search key made up of several properties. In the
sections that follow, we will go through these strategies in more detail. Keys with several properties will
be referred to as composite keys.
26
UNIT 05: Storage and Indexing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
result, any value of m and n comes first. For keys with Dno = 4, the ascending key sequence would be, and
so on. Lexicographic ordering works in the same way as character string ordering does.
5.9 GLOSSARY
27
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Dynamic hashing: A forerunner of extendible hashing, in which the addresses of the buckets were
either the n high-order bits or the n 1 high order bits
Linear hashing: It allows a hash file to automatically increase and reduce its number of buckets
without the requirement for a directory
Primary index: An ordered file with fixed-length records and two fields that serve as an access
structure to quickly search for and retrieve data records in a data file
Clustering index: An ordered file with two fields in which, the first is of the same type as the data
file’s clustering field, and the second is a disc block pointer
Secondary index: A secondary method of accessing a data file that already has a primary method
of access
Search tree: A form of a tree that guides the search for a record based on the value of one of the
record’s fields
B+ tree: The majority of dynamic multilevel index implementations employ a variant of the B-tree
data structure
4. If file entries are physically sorted on a non-key field with no distinct value for each record, that field
is referred to as the clustering field, and the data file is referred to as a clustered file. Refer to Section
Types of Single-Level Ordered Indexes
5. The majority of dynamic multilevel index implementations employ a variant of the B-tree data
structure known as a B +-tree. Refer to Section Dynamic Multi-Level Indexes Using B Trees and B+
Tree
http://rajaghoshtech2.blogspot.com/2010/03/advantages-and-disadvantages-of-hashing.html
http://www.differencebetween.info/difference-between-b-tree-and-b-plus-tree
Discuss with your friends and classmates about the concept of indexing. Also, discuss the importance
of hashing techniques.
29
UNIT
06 Object-Relational Databases
Names of Sub-Units
Introduction, Overview of Object Relational Database Concepts, Object Relational Features, Object
Database Extensions to SQL, Object Oriented Models
Overview
This unit begins by discussing about the concept of object relational database. Next, the unit discusses
the overview of object database concepts and object relational features. Further the unit explains the
object database extensions to SQL. Towards the end, the unit discusses the object oriented models.
Learning Objectives
Learning Outcomes
https://www.cs.uct.ac.za/mit_notes/database/pdfs/chp16.pdf
6.1 INTRODUCTION
A Database Management System (DBMS) that combines a relational database (RDBMS) and an object-
oriented database (OODBMS) is known as an object-relational database (ORD). ORD’s schemas and
query language support the core components of any object-oriented database model, such as objects,
classes, and inheritance.
ORD is referred to as the “middleman” between relational and object-oriented databases because it
combines traits and characteristics from both models. Since the data is kept in a typical database and
processed and accessed via queries defined in a query language like SQL, the basic method in ORD is
RDB. ORD, on the other hand, has an object-oriented feature in that it is regarded an object store, which
is typically used for software built in an object-oriented programming language. APIs are used to store
and access data as objects in this case.
Traditional RDBMS products focus on the effective organisation of data produced from a small number
of data sources. An ORDBMS, on the other hand, offers a capability that allows developers to create and
innovate their own data types and techniques for use in the DBMS. ORDBMS hopes to achieve this by
allowing developers to view the problem area at a higher level of abstraction.
2
UNIT 06: Object-Relational Databases JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
An object normally consists of two parts: its state (value) and its behaviour (operations). It can have a
complicated data structure as well as operations that the programmer defines. Objects in an OOPL exist
just during programme execution; hence they are referred to as transitory objects. An OO database can
prolong the existence of objects by storing them permanently in a database, transforming them into
persistent objects that survive after programme termination and can be accessed and shared by other
applications. In other words, OO databases permanently retain persistent items in secondary storage
and allow them to be shared by various programmes and applications. This necessitates the inclusion of
other well-known database management system capabilities, such as indexing techniques to efficiently
locate the objects, concurrency control to allow object sharing across concurrent processes, and failure
recovery. An OO database system will frequently interact with one or more OO programming languages
to offer persistent and distributed object capabilities.
In OOPLs, an object’s internal structure is defined by instance variables, which hold the values that
make up the object’s internal state. In the relational model, an instance variable is analogous to the
idea of an attribute, except that instance variables might be wrapped within the object and so are not
always available to external users. Instance variables can also have data types that are arbitrarily
complex. Object-oriented systems allow you to specify operations or functions (behaviour) that can be
applied to specified types of objects. In fact, some OO models require that all activities that a user may
do on an object be specified. This necessitates the full encapsulation of things. For two reasons, most
OO data models have eased this rigorous approach. To begin, database users frequently need to know
the attribute names in order to define selection criteria on the attributes in order to get certain objects.
Second, because perfect encapsulation implies that every basic retrieval requires a specified process, ad
hoc searches are difficult to express on the fly.
An operation is divided into two stages to facilitate encapsulation. The first portion, known as the
operation’s signature or interface, defines the operation name and arguments (or parameters). The
second part, known as the method or body, specifies the operation’s implementation, which is typically
written in a general-purpose programming language. Operations are triggered by sending a message
to an object that includes the operation name and arguments. The method for that action is then
executed by the object. This encapsulation allows for the alteration of an object’s internal structure as
well as the implementation of its operations without disrupting the external programs that execute
these activities. As a result, encapsulation enables data and operational independence.
Type and class hierarchies, as well as inheritance, are important concepts in OO systems. This enables
the definition of new types or classes that inherit a large portion of their structure and/or operations
from previously specified types or classes. This makes it easy to iteratively construct a system’s data
types and reuse existing type definitions when generating new sorts of objects.
One issue in early OO database systems was describing connections between objects. The early OO data
model’s stress on total encapsulation led to the idea that relationships should not be overtly represented
3
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
but rather expressed by creating suitable functions that find related objects. However, this strategy
may not work well for complicated datasets with many relationships since it is necessary to identify and
display these links to users.
Another OO notion is operator overloading, which refers to the capacity of an operation to be applied
to several sorts of objects; in such a case, an operation name may refer to numerous separate
implementations, depending on the type of object to which it is applied. This characteristic is also known
as operator polymorphism. For example, the technique (implementation) of calculating the area of a
geometric object may change depending on whether the object is a triangle, circle, or rectangle. This
may need the usage of runtime late binding of the operation name to the proper method after the type
of object to which the operation is applied is known.
Some early OO data models required that everything be represented as an object, from a simple value to
a sophisticated object; hence, every basic value, such as an integer, string, or Boolean value, had an OID.
This allows two identical basic values to have distinct OIDs, which is advantageous in some situations.
4
UNIT 06: Object-Relational Databases JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
For example, the integer number 50 can sometimes be used to represent weight in kilos, and other times
it might represent a person’s age. Then, two basic objects with unique OIDs may be generated, but both
would represent the integer value 50. Although valuable as a theoretical model, this is not particularly
practical because it generates an excessive number of OIDs. As a result, most OO database systems
support the representation of both objects and literals (or values). An immutable OID is required for
every object; however, a simple value does not have an OID and its value remains alone. As a result, a
literal value is usually held within an object and cannot be accessed from other objects. If necessary,
sophisticated structured literal values can be constructed in many systems without a matching OID.
One of the primaries aims of the Object-relational data model is to bridge the gap between relational
databases and the object-oriented approaches used in many programming languages such as C++, C#,
Java, and others.
5
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Many aspects of software development, including database systems, are being influenced by object-
oriented technology. Support for XML was added to SQL in the 1999 and 2003 standards. A few object-
oriented ideas the data model that these specifications use is known as the data model is object-relational.
This model incorporates all of the traditional concepts from the object data model incorporates certain
ideas from the relational data paradigm and the features that enable the construction of classes, Objects
and class hierarchies.
The object-relational data model is not a complete object data model because the top-level entries in
the database are confined to be tables. Tables can have attribute values for objects, while typed tables
only have objects. Servers from Oracle implement certain object-relational characteristics and can be
utilised in object-oriented applications projects with a database component should be designed.
Data and their interactions are contained in a single structure in an Object Oriented Data Model, which
is referred to as an object. Real-world problems are represented as objects with various qualities in this
model. There are various relationships between every object. It is essentially a blend of Object Oriented
Programming and Relational Database Model.
6
UNIT 06: Object-Relational Databases JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Hierarchy: The hierarchy depicts the order in which objects in a system are arranged. It also
describes the link between the many components of a system. A class of the hierarchy is made up
of several characteristics and functions. A hierarchy class is made up of a base (parent) class and
derived classes (subclass). The properties of a parent class are passed down to a derived class. A
class can be made of inter-related sub-classes, which in turn can have sub-classes until the smallest
level of components is achieved.
Modularity: Modularity refers to the division of a program into components or modules in order to
lessen the complexity of the problem. Modularity occurs in large and strong programs with several
classes. The modules aid in the management of complexity. Implementation is the emphasis on
modularity. Creating a link between modularity and encapsulation. Modularity may be thought of
as a method of breaking down enclosed abstraction into simple components. This occurs following
the partitioning of a system into modules.
A data description language that is object-oriented an extension of IDL (Interface Description Language),
which is a component of Common Object Request Broker Architecture (CORBA) is a networked OO
computing standard. ODMG standardized (Object Data Management Group). OID (object identity): a
deviation from the E/R model Attributes, relationships, and methods are examples of ODL properties.
The EYEDB Object Definition Language (ODL) is a specification language for defining object type
specifications based on the ODMG ODL (but not compliant).
ODL 1 is not meant to be a full-fledged programming language. It is a language for defining object
specifications. Traditionally, database management systems give tools for data definition (through a
Data Definition Language (DDL)). The Data Definition Language (DDL) allows users to define data types
and interfaces, while the Data Manipulation Language (DML) allows users to create, remove, read, and
update instances of the data types. ODL is a kind of DDL for objects. It defines type features, such as
attributes and operations. ODL merely specifies the signatures of C++ operations; it does not address
the definitions of the methods that implement those operations. The ODL can specify operations defined
in OQL.
The goal of ODL is to specify object types that may be implemented in a range of computer languages.
As a result, ODL is not bound by the syntax of any single programming language.
7
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Several characteristics distinguish EYEDB ODL from ODMG ODL are as follows:
ODMG Class properties, relationships, method signatures, and keys are all defined by ODL. Nested
classes, typedef constructions, constant definitions, and exception hints are all supported.
The EYEDB ODL specification specifies class attributes, relationships, method signatures, attribute
restrictions (not null, unique, collection cardinality), index specifications and trigger declarations.
Nested classes, typedef constructions, constant declarations and exception hints are not supported.
Any type of instance in EYEDB ODL can be both a literal and an object. This attribute is connected
to the type in ODMG ODL: Interfaces and classes are objects, while all fundamental types and user-
defined structs are literals. Any type of instance, along with the fundamental types, can be an object
in EYEDB ODL.
Finally, You can specify if a method is run by the server or the client, as well as whether it is a class
or instance method, using EYEDB ODL.
Assuming this issue applies to the Relational Database Management System (RDBMS) paradigm, there
are three common design stages are as follows:
Design of an abstract database: This approach necessitates the creation of a data model of all of
the information in a system. The resultant entity-relationship diagram (ERDM) will reflect the real-
world domain that the database will describe, and it will be evaluated and examined for redundancy
to see whether it fits user/business needs.
Design of a logical database: The design of the database from the ERDM is part of this procedure.
This is frequently accomplished through a process of data normalization (typically to the third
normal form) and entity modeling to construct the tables and relationships that will comprise the
database’s design (or schema). This step does not examine which implementation of the RDBMS
paradigm will be used, such as MySQL, SQL Server, DB2, and so on, or which hardware platform the
database would sit on.
Design of the physical database: The database’s physical design describes the physical setup of
the database in the chosen RDBMS implementation. This stage entails defining the basic relations,
file organizations, indexes of primary or foreign keys in tables, entity, and referential integrity
requirements, and security measures, among other things.
8
UNIT 06: Object-Relational Databases JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
A database’s conceptual design refers to the first step of the design process, which encompasses
design interactions, experience, and strategies.
More than merely expressing tiny portions of the program in terms of ER [Entity Relationship]
diagrams is included in Conceptual Design. A big corporation will necessitate the efforts of more
than one designer, data, and application code utilized by a variety of user groups. Using a high-
level, semantic data model for conceptual design, such as ER diagrams, in such an environment has
the added benefit of allowing the high-level design to be diagrammatically represented and easily
understood by the many individuals who must contribute input to the design process.
The approach used to structure the creation of the overall design and to guarantee that the design
takes into consideration all user needs and is consistent is an essential part of the design process.
A possible method would be to create adistinct conceptual schema for each user group and then merge
this conceptual schema. To integrate diverse conceptual schemas, we must build correspondences
between entities, relationships, and characteristics and resolve a variety of conflicts (e.g., naming
conflicts, domain mismatches, and differences in measurement units).
The ODMG query language OQL was purposefully intended to have a syntax similar to SQL in order to
make it simple for SQL users to learn OQL. Let us begin with a query that finds pairings of movies and
theatres in which the movie is displayed and the theatre is showing more than one movie:
SELECT mname: M. movieName, tname: T,theaterName
FROM Movies M, M.shown At T
WHERE T.numshowing() > 1
The SELECT clause specifies how we might name the fields in the result; the two result fields are referred
to as mname and tname. The FROM clause is the component of this query that differs from SQL. The
variable M is tied to each movie in the scope of Movies in turn. We associate the variable T to each
theatre in the collection M. shown at in turn for a given movie M. As a result, we can simply describe a
nested query by using the path expression M. shown At. The query below demonstrates the grouping
construct in OQL:
SELECT T.ticketPrice,
avgNum: AVG(SELECT P.T.numshowing() FROM partition P)
FROM Theaters T
GROUP BY T.ticketPrice
We build a collection of theatres with that ticket price for each ticket price. This set of theatres is referred
to as the partition for that ticket price and is denoted by the OQL term partition. In the SELECT clause,
9
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
we compute the average number of movies shown at theatres in the partition for each ticket price. OQL
has an intriguing version of the grouping operator that SQL does not have:
SELECT low, high,
avgNum: AVG(SELECT P.T.numshowing() FROM partition P)
FROM Theaters T
GROUP BY low: T.ticketPrice <5, high: T.ticketPrice >=5
The GROUP BY clause now only produces two divisions, low and high. Each theatre object T is assigned
to one of these divisions’ dependent on the price of its ticket. In the SELECT clause, low and high are
boolean variables, one of which is always true in each given output tuple; partition is instantiated to the
relevant partition of theatre objects. We obtain two result tuples in our case. One of them has low equal
to true and avgNum equal to the average number of movies played in low-cost theatres. The second
tuple has high equal to true and avgNum equal to the average number of movies shown at high-priced
cinemas. The following query demonstrates OQL capability for collections other than set and multiset:
(SELECT T.theaterName
FROM Theaters T
ORDER BY T.ticketPrice DESC) [0:4]
The ORDER BY clause produces a list of theatre names sorted by ticket price. A list’s items can be referred
to by position, beginning at position 0. As a result, the equation [0:4] returns a list of the names of ve
theatres with the highest ticket costs.
10
UNIT 06: Object-Relational Databases JGI JAINDEEMED-TO-BE UNIVERSIT Y
Sts Name
Sts Name
Year
Sts Surname
Student Enrollno Events
Department
ID
Credit Score
University
Enrollno Year
Features
Semester
DOB
Attemots
Sessional
Sts Name
Department
Year Year
Internal
Marks Branch
University
External Enrollno
Marks
11
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
The many-to-many attempts relationships link a student to a department; a department can exist
without a student, and a student can enrol without attempt any events, therefore participation is
not complete.
There are properties to capture the Year and Semester, as well as the Marks, when a student attempts
a course.
A database management system (DBMS) that combines a relational database (RDBMS) and an
object-oriented database (OODBMS) is known as an object-relational database (ORD).
A data model based on object-oriented programming that associates methods (procedures) with
objects that can benefit from class hierarchies is known as an object data model.
Data and their interactions are contained in a single structure in an Object Oriented Data Model,
which is referred to as an object.
Encapsulation is performed by concealing information or data in order to minimize complexity and
promote reusability.
The ODL is independent of any programming language and is designed to support the semantic
constructs of the ODMG object model.
The Data Definition Language (DDL) allows users to define data types and interfaces.
Data Manipulation Language (DML) allows users to create, remove, read, and update instances of
the data types.
The query language proposed for the ODMG object model is the object query language OQL.
6.8 GLOSSARY
12
UNIT 06: Object-Relational Databases JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
1. A database management system (DBMS) that combines a relational database (RDBMS) and an
object-oriented database (OODBMS) is known as an object-relational database (ORD). Refer to
Section Introduction
2. A data model based on object-oriented programming that associates methods (procedures) with
objects that can benefit from class hierarchies is known as an object data model. Refer to Section
Object Oriented Models
3. The ODL is independent of any programming language and is designed to support the semantic
constructs of the ODMG object model. Refer to Section Object Oriented Mod
4. It is intended to interact closely with programming languages that have an ODMG binding, such
as C++, Smalltalk, and Java. As a result, an OQL query embedded in one of these programming
languages can return objects that match that language’s type system. Refer to Section Object
Oriented Models
5. The object-relational extensions to SQL refer to contains row types, collection types, user defined
types (UDT), typed tables, and reference types are all object-relational extensions to SQL. Refer to
Section Object Database Extensions to SQL
13
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
https://asolanki.co.in/wp-content/uploads/2019/02/Fundamentals_of_Database_Systems_6th_
Edition-1.pdf
https://vivadifferences.com/12-differences-between-object-oriented-database-and-object-
relational-database-plus-similarities/
Discuss with your friends and classmates the concept of Object Relational Database (ORD). Also,
discuss about the different types of object data models.
14
UNIT
Names of Sub-Units
Overview
This unit begins by discussing about the concept of query processing and optimization. Next, the unit
discusses the measures of query cost selection operation and query optimization overview. Further
the unit explains the transformation of relational expression and estimating statistics of expression
results. Towards the end, the unit discusses the choice of evaluation plans.
Learning Objectives
Learning Outcomes
https://clas.iusb.edu/computer-science-informatics/research/reports/TR-20080105-1.pdf
7.1 INTRODUCTION
Query processing and optimization are essential components of every DBMS. To be used efficiently,
query results must be accessible in the timescale required by the submitting user, which might be a
human, a robotic assembly machine, or even another unique and independent DBMS.
The following are the steps required in query processing and optimization:
The query execution plan, also known as the query evaluation plan, is a set of primitive actions that
can be used to assess a query.
The query processing engine accepts a query evaluation plan, runs it, and outputs the desired
results. Depending on the number of discs, different execution plans for a given query can have
varying prices. It is the system’s obligation to create a query evaluation plan that minimises query
evaluation costs. Query optimization is the term for this process.
A query optimization is inspected, parsed, and validated using a high-level query language. In the
text of the query, the scanner detects SQL keywords, attributes, and relation names.
The parser examines the syntax to see if the query is written in accordance with the query language’s
syntactic guidelines.
Finally, the query is assessed by ensuring that all attributes and relation names in a database’s
schema are valid and conceptually relevant.
The query is then represented internally as a tree or a graph, referred to as the query tree or query
graph.
A query tree is the internal representation of a SQL query when it is translated into relational
algebra. If TRC or DRC is not used, the internal representation is a query graph. A graph contains
many different execution strategies, and query optimization is the process of selecting the most
cost-effective technique.
2
UNIT 07: Query Processing and Optimization JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Query in high-level
language
Parser and
translator
Internal representation of
the query
Query optimezer
Execution plan
DBMS catalog
Query Evaluation
engine
The cost of a query in a database management system can be calculated by establishing a framework
that can generate multiple designs for a query. It is usually accomplished by analyzing every possible
arrangement in terms of their estimated cost. The cost of each activity within an arrangement should be
established in a deterministic and aggregated cost to get the net assessed cost of the query assessment
plan when calculating the net evaluated cost of any arrangement.
Selection Operation
In most cases, the file scan performs the selection operation. The search algorithms that are utilised to
locate and retrieve data are known as file scans. In query processing, it is the lowest-level operator.
3
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
The file scans in RDBMS (relational database management systems) reads a relation only if the entire
relation is stored in one file. When performing a selection operation on a relation with tuples contained
in a single file, the following algorithms are used:
Linear search: In a linear search, the system checks each record to see if it meets the specified
criteria. An initial seek is required to access the first block of a file. If the blocks in the file are not
stored in the same order, more seeks are required. However, linear search is the slowest search
method, yet it can be employed in a variety of situations. The nature of the selection, the presence of
indices, and the file sequence are irrelevant to this procedure. Other algorithms, on the other hand,
aren’t appropriate in all situations.
Index scans are a type of index-based search method. Access pathways are index structures like this. The
data in the file can be located and accessed using all these paths. The index is used in query processing
by the following algorithms:
Primary index, equality on a key: To make the selection, we use the index to choose a single record
that meets the equality criteria. The equality check is done on the key attribute with the primary key.
Primary index, equality on non-key: The difference among equality on key and equality on non-key
is that we can retrieve many records in this case. When the selection criteria provide an equality
comparison on a non-key, we can get several entries using primary keys.
Secondary index, equality on key or non-key: The secondary index can be used in a selection that
defines an equality criterion. We can obtain a single record when equality is on key or several records
while equality is on non-key using the secondary index strategy. The time cost of retrieving a single
record is equal to the primary index. Numerous records may be stored on various blocks in the case
of multiple records. This means that each fetched record requires one I/O operation, which includes
a seek and a block transfer.
7.2.1 Sorting
Sorting is the method of storing records in one or more columns in ascending or descending order. It’s
important because some queries will require us to return sorted records, as well as some processes, such
as joins, will be more efficient with sorted records. The primary key column is used to sort all of the data
by default. We can also specify that the data be sorted based on other columns if necessary. There are
primarily two sorts of sorting procedures employed.
The sorting method is used in query processing to efficiently perform various relational operations such
as joins and so on. The system, on the other hand, requires a sorted input value. We must create an index
on the sort key in order to sort any relation, and then use that index to read the relation in sorted order.
Using an index, however, we may arrange the relationship rationally rather than physically. As a result,
sorting is done for the following cases:
Case 1: Relationships those are smaller or medium in size than the primary memory.
Case 2: Relations with a larger size than the available memory.
4
UNIT 07: Query Processing and Optimization JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
In Case 1, the primary memory is not exceeded by the little or medium-sized relations. As a result, we’ll
be able to remember them. As a result, we may utilise typical sorting methods like quicksort, merge sort,
and so on.
The conventional algorithms do not operate effectively in Case 2. We employ the external sort-merge
technique for such relations whose size surpasses the memory limit.
Sorting of relations that do not fit in memory due to their size being greater than the memory size.
External sorting is a term for this form of sorting. As a result, for external sorting, the external-sort
merge is the most appropriate option.
With the help of an example, explain how the external merge-sort algorithm works and analyse the cost
of external sorting is shown in Figure 2:
a 18
a 23
g 25 d 30 a 16
b 12
a 18 g 25 a 18
c 32
d 32 b 13
b 13 d 30
c 34 c 34
c 32 e 18
b 15 d 5
e 19 g 23
e 17 d 22
r 17 d 32
d 22
d 22 a 13 e 15
m 2
m 4 d 8 c 23
r 12
p 3 d 22 m 3
d 6 m 4 p 6
a 13
a 15 p 3 r 15
d 8
initial r 17 sorted
p 1
relation runs output
runs
create merge merge
runs pass-1 pass-2
Assume we’re executing an external sort-merge on a relation R. Suppose that each block can only carry
one tuple and that the memory can only hold three blocks. As a result, it will use two blocks as input and
one block as output while performing Stage 2, i.e. the merge stage.
5
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
a1 a2 a1 a3
pr ps
a 3 a A
b 1 b G
d 8 c L
d 13 d M
f 7 m B
m 5 s
q 6
6
UNIT 07: Query Processing and Optimization JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Hash-join: Useful for equi-joins and natural joins. To split tuples of both relations, a hash function
h is used. h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the common attributes of
r and s used in the natural join.
r 0, r 1, . . ., r n denotes partitions of r tuples
Each tuple t r r is put in partition r i where i = h (t r [JoinAttrs]).
r 0, r 1 . . ., r n denotes partitions of s tuples
Each tuple t s s is put in partition s i, where i = h (t s [JoinAttrs]).
Note: In book, r i is denoted as H ri, s i is denoted as H s i and n is denoted as n h.
Figure 4 depicts hash join:
0 0
1 1
2 2
3 3
s
4 4
r
partitions partitions
of r of s
7
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
r 1 2 ... n s
Use nested loops/block nested loops instead of Compute as the union of the records in each joins r
i s: r 1 s ) ( r 2 s) . . . ( r n s)
Aggregation may be implemented in the same way that duplicate removal is. To gather tuples in the
same group together, sorting or hashing may be employed, and then the aggregate methods can be
applied to each group.
Optimization is used to compute partial aggregate values to combine tuples in the same group during
run generation and intermediate merges. Set operations (, and −): can employ either a variation of
merge join or a variant of hash-join after sorting. The outer join can be calculated as an A join followed
by the inclusion of null-padded non-participating tuples by the modification of the join algorithms.
The processed query is then submitted to the query optimizer, which generates the results. Various
execution plans to analyse the parsed query and choose the option with the lowest estimated cost
The catalogue manager assists the optimizer in determining the optimum plan for executing queries
and calculating the cost of each query plan.
Query optimization is a technique for quickly accessing a database. It’s the art of getting the information
you want in a predictable, dependable, and timely manner. Query optimization is defined as the
process of changing a query into an equivalent form that can be assessed more quickly. The goal of
query optimization is to find an implementation plan that cuts down on the amount of time it takes to
process a query. We must complete two key tasks in order to attain this optimization target. The first is
to determine the optimal plan, and the second is to reduce the amount of time spent running the query
plan.
8
UNIT 07: Query Processing and Optimization JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Cost-Based Optimization
The optimizer assigns a cost in numerical form for each step of a feasible plan for a given query and
environment, and then discovers these values together to get a cost estimate for the plan or possible
strategy. The optimizer aims to find the plan with the lowest cost estimate after evaluating the costs of
all feasible plans. The optimizer is frequently referred to as the cost-based optimizer.
Heuristic-Based Optimization
Heuristic-based optimization is also known as rule-based optimization. This is based on the equivalence
rule for relational expressions; as a result, the number of possible query combinations is reduced. As a
result, the query’s cost decreases.
A query tree’s internal nodes reflect particular relational algebra operations to be done on the relation(s)
supplied into them from the child node(s) directly below. The relationship is represented by the tree’s
leaves (s). The tree is examined from the ground up, resulting in a distinct assessment strategy. we saw
how a query’s query tree may be built in a variety of methods that are all identical. In many cases, at
least one of these comparable trees will generate a speedier, “optimized” execution plan.
Semantic-based query optimization is another method of query optimization. Many times, the data
inside and between relations contain “rules” and patterns based on “real-world” events that the DBMS
is unaware of. As an example, vehicles like the DeLorean were not made after 1990, so a query like
“Retrieve all vehicles with make equal to DeLorean and year > 2000” will produce zero records. Injecting
these types of semantic rules into a DBMS can thus further enhance a query’s execution time.
9
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
The number of entries in the database, the number of blocks assigned to each table, the number of
records in each block, the size of the record, duplicate records, the height of the B+ tree, constraints, and
indexes, and other parameters all have an impact on query speed. Let’s look at how to choose a better-
performing query. When a SQL query is submitted to the database, it can be assessed in several ways.
Consider the following example:
SELECT * FROM Employee;
SELECT Id, Name, Designation
FROM Employee
WHERE ID = 2
The query above pulls the Id, Name and Designation from the Employee and Id tables for Id = 2.
However, when it is passed to the DBMS, it breaks the query into tokens and examines how it might
be reassembled to improve performance. The query optimizer is in charge of this. However, changing
the sequence of the tokens in the query should have no effect on the outcome. In any case, the outcome
should be the same. The order of the records is not fixed and is of the utmost importance. This is known
as an analogous inquiry. To insert tokens in the query, a series of rules must be followed. This is known
as the equivalence rule.
10
UNIT 07: Query Processing and Optimization JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
The number of distinct values for a column, d Ar. This is useful when a query employs an aggregate
or projection technique. It will produce an estimate based on the number of columns chosen during
projection. When the Aggregation operation is performed in the query, this may be utilized to
determine the number of record groups. For example, SUM, MAX, MIN, COUNT, and so on.
The index’s levels, x. This data indicates if a single level of an index, such as a primary key index,
secondary key indexes, or multi-level indexes, such as a B+ tree index, merge-sort index, and so on,
is utilized. These index levels will offer information on how many block accesses are necessary to get
the data.
A column’s selection cardinality, s A. This is the number of records in the table that have the same
column value as A. This is computed as nr/d Ar, which is the total number of records with an A value.
Assume the EMP table contains 500 rows and DEPT_ID has five different values. Then, in the EMP
table, the selection cardinality of DEPT_ID is 500/ 5 = 100.
That indicates that each department has an average of 100 people. This is useful in calculating the
average number of records that would meet the selection criteria.
There are several more elements to consider, such as index type, data file type, sorting order, sorting
type, and so on.
So far, we’ve seen how queries are parsed and navigated, how they’re evaluated using various ways,
and what the costs are when different methods are utilized. The key part of analyzing a query now is
determining which evaluation plan to use so that it can be traversed effectively. It compiles all statistics,
costs, access/evaluation pathways, relationship trees, and so on. It then analyses them and selects the
most appropriate assessment path.
It computes the expenses depending on the various ways we’ve seen thus far. It searches the data
dictionary for the operator, joining type, indexes, number of records, selectivity of records, distinct
values, and so on. Once all of this information has been gathered, it selects the optimal assessment
strategy.
The following SQL program is used to Implement Query Optimizer with Relational Algebraic expression
construction and execution plan generation for choosing an efficient execution strategy for processing
the given queries:
11
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
BEGIN TRANSACTION;
CREATE TABLE Employee(Id integer, Name VARCHAR, Designation VARCHAR,
Salary Integer);
INSERT INTO Employee VALUES(1,'Babita', 'Publisher', 80000);
INSERT INTO Employee VALUES(2,'Sumit', 'Sales Manager', 50000);
INSERT INTO Employee VALUES(3,'Rahul', 'Account Manager', 90000);
INSERT INTO Employee VALUES(4,'Rakesh', 'Book Editor', 70000);
INSERT INTO Employee VALUES(5,'Neeraj', 'Content Writer', 60000);
SELECT * FROM Employee;
SELECT Name
FROM Employee
WHERE id
IN (SELECT id
FROM Employee
WHERE id = 4);
Query cost is a cost that the optimizer calculates based on how much time your query will take
(comparative with absolute clump time).
The file scans in RDBMS (relational database management systems) reads a relation only if the
entire relation is stored in one file.
In a linear search, the system checks each record to see if it meets the specified criteria.
A join is a SQL action that creates a relationship between two or more database tables by establishing
a connection between them based on matching columns.
A merge join sort both relations according to their join attribute (if not already sorted on the join
attributes).
To split tuples of both relations, a hash function h is used.
12
UNIT 07: Query Processing and Optimization JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
The term “complicated” refers to the fact that SQL performs an inner join as well as acquiring
additional data from one or more tables.
The process of selecting an effective execution plan for processing a query is known as query
optimization.
7.9 GLOSSARY
Query cost: A cost that the optimizer calculates based on how much time your query will take
(comparative with absolute clump time).
File scans: Reads a relation only if the entire relation is stored in one file.
Linear search: The system checks each record to see if it meets the specified criteria.
Join operation: A SQL operation that creates a relationship between two or more database tables by
establishing a connection between them based on matching columns.
Merge join: An operation which sort both relations according to their join attribute (if not already
sorted on the join attributes).
Hash join: To split tuples of both relations.
Complicated join: A SQL operation performs an inner join as well as acquiring additional data from
one or more tables.
Query optimization: The process of selecting an effective execution plan for processing a query.
13
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Section Introduction
2. Query cost is a cost that the optimizer calculates based on how much time your query will take
(comparative with absolute clump time). Refer to Section Measures of Query Cost Selection Operation
3. The process of selecting an effective execution plan for processing a query is known as query
optimization. Refer to Section Query Optimization Overview
4. The optimizer assigns a cost in numerical form for each step of a feasible plan for a given query and
environment, and then discovers these values together to get a cost estimate for the plan or possible
strategy. Refer to Section Query Optimization Overview
5. The sorting method is used in query processing to efficiently perform various relational operations
such as joins and so on. Refer to Section Measures of Query Cost Selection Operation
https://www.serverwatch.com/guides/sql-query-optimization/
https://docs.oracle.com/database/121/TGSQL/tgsql_optcncpt.htm#TGSQL204
Discuss with your friends and classmates about the concept of query processing. Also, discuss the
importance of query optimization.
14
UNIT
08 Distributed Database
Names of Sub-Units
Overview
This unit begins by discussing about the concept of distributed database. Next, the unit discusses
the distributed database principles and transparencies in distributed database. Further the unit
explains the transaction control in distributed database. Towards the end, the unit discusses the query
processing in distributed database.
Learning Objectives
Learning Outcomes
https://cs.uwaterloo.ca/~tozsu/publications/distdb/distdb.pdf
8.1 INTRODUCTION
A distributed database is a collection of numerous conceptually interconnected databases that are
scattered across a network. A distributed database system is made up of loosely connected sites that
have no physical components in common. Each site’s database systems are independent of one another.
Transactions may get access to data at one or more locations. In a distributed database system, data is
kept across several locations, and each site is generally maintained by a DBMS that runs independently
of the others. DDBMS is made up of a single huge logical database that is broken into many parts.
DDBMS applications are frequently classified into two categories:
Local applications
Global applications
Users should be allowed to ask queries without identifying the location of the referenced relations,
copies, or fragments of the relations.
Users should be able to write transactions that access and change data at several locations in the same
way that they would write transactions over exclusively local data. The effect of a transaction between
sites, in particular, should remain atomic.
The distributed database management system (DDBMS) is a concept in which data and maybe certain
control components are distributed among numerous computers linked by a network. The client process
that accesses such a DDBMS may be located on the same computer as the database or on a separate
machine. This client-server approach has grown in popularity in large enterprises for both internal
business functions and Internet services such as e-commerce websites and online multiplayer games
2
UNIT 08: Distributed Database JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
(where millions of players around the world simultaneously connect to servers hosted by the provider
via an internet connection). These applications have created massive volumes of data that must be
stored someplace, accessed quickly, and modified in a variety of ways. Distributed databases provide
capabilities that centralized databases do not. They not only aid in the resolution of today’s data storage
issues, but they may also alter how programs operate on the internet and elsewhere.
Data are kept at one or more dedicated network locations that operate as servers and oversee the
retrieval, updating, and other management of that data under the client-server model. A file server
is one sort of server that maintains files saved on behalf of customers. A database server (DBMS) is
another type: it maintains enormous volumes of connected information such as customer records,
product descriptions, and so on. All of this information is physically kept on one or more huge mainframe
computers (known as databases). There are also clusters of specialized servers dedicated to certain tasks,
such as online search engines and e-commerce apps. To provide improved availability and faster access
to its information, database servers can be duplicated, for example, by keeping copies at geographically
distant sites.
Distributed database systems differ from centralized database management systems (DBMS) in that
they seek to preserve data consistency while updating sections of their data simultaneously. The
requirement to maintain a consistent view is the most significant element that distinguishes distributed
databases from other types of databases. This limitation imposes stringent constraints on how changes
must be handled and creates some challenging issues in concurrency control, transaction management,
and query processing. Distributed DBMSs should also offer availability by continuing to run despite
the failure or outage of one or more network locations; typically, this implies that the system should
continue to function if any of its nodes or links fail. Consistency can be achieved by replicating data
across different computers (“sharding”) or by using procedures such as two-phase commit.
Some of the fundamentals principles of distributed database system are as follows:
A distributed system should appear to the user exactly like a non-distributed system.
Local self-government
There is no dependency on a central location
Independence/transparency of place
Independence from fragmentation (vertical, horizontal, or both)
Independence from replication
Hardware autonomy
Independence from the operating system
Network autonomy
Independence of the database management system
8.2.1 Architectures
The distributed database system can use applications from both local and remote databases. The
architecture of a system affects its structure, and distributed database management systems can be
implemented in a homogeneous or heterogeneous manner.
3
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
The architecture will help in the comprehension of the system’s components as well as the definition of
their functions. The architecture will also specify how the various components interact with one another.
A HDDBS is a network of two or more Oracle Databases that run on one or more PCs. The diagram shows
a distributed system with three databases: headquarters, manufacturing, and sales. An programme
can access or change data in multiple databases at the same time in a distributed environment. A single
query from a Manufacturing client on the local database mfg, for example, can return data from both
the local database’s products table and the distant hq database’s department table.
A client application has no knowledge of the database’s location or platform. You can create synonyms
for distant items in a distributed system so that users are able to access them using the same syntax
as local objects. If you’re connected to database mfg but need to access data from database hq, for
example, you can run this query by creating a synonym for the distant dept table on mfg:
SELECT *FROM dept;
In this approach, a distributed system seems to have native data access. Users of mfg are not required
to be aware that the data they access is stored in distant databases. Figure 1 depicts the homogenous
distribution database:
Sales
4
UNIT 08: Distributed Database JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Oracle database distributed database systems can contain Oracle Databases of various versions. A
distributed database system can use any supported release of Oracle Database. However,
programs that interact with the distributed database must be aware of the capabilities offered at
each node in the system. A distributed database application cannot expect an Oracle7 database to
recognize SQLextensions accessible exclusively with Oracle Database.
Different schemas, operating systems, DDBMS, and data models are used in a heterogeneous distributed
database.
A particular site in a heterogeneous distributed database may be completely uninformed of other sites,
resulting in limited cooperation in processing user requests. Translations are required to establish
communication across sites due to the constraint.
In a heterogeneous distributed database system, at least one database is not an Oracle database. The
application sees the heterogeneous distributed database system as a single, local Oracle Database. The
dispersion and heterogeneity of the data are hidden by the local Oracle Database server.
The Oracle Database server connects to the non-Oracle Database system through Oracle Heterogeneous
Services and an agent. If you use an Oracle Transparent Gateway to access a non-Oracle Database data
store, the agent is a system-specific program. If you include a Sybase database in an Oracle Database
distributed system, for example, you must get a Sybase-specific transparent gateway so that the Oracle
Database in the system can interact with it. Alternatively, if the non-Oracle Database system supports
the ODBC or OLE DB protocols, you can utilize a generic connection to access non - Oracle Database data
stores.
5
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Client Client
TCP Connections
Server
Client
Client
8.2.2 Design
A distributed database system’s design is a difficult task. As a result, a careful analysis of the strategies
and objectives is required. The following are some of the general strategies and objectives found in most
DBS designs:
Data replication
Data allocation
Data fragmentation
Data Replication
The storing of data copies at several network locations is known as data replication. Fragment copies
can be kept in many locations, increasing data availability and reaction speed. A mutual consistency
rule applies to replicated data. To maintain data consistency throughout all replications, this rule
demands that all copies of the data pieces be identical.
Although data replication improves availability and reaction times, it might be difficult to maintain
the replications. If data is replicated across many sites, for example, the DDBMS must select which copy
to access. The nearest copy is all that is necessary to satisfy a transaction for a query operation. If the
operation is an update, however, all copies must be picked and modified in order to meet the mutual
consistency criterion.
A database can be either fully replicated, partially replicated or unreplicated as shown in Table 1:
Database Description
Full replication Stores several copies of each database fragment at different locations. Because of the
amount of overhead put on the system, fully replicated databases might be unfeasible.
Partial replication Stores numerous copies of some database pieces at different locations. Most DDBMS are
capable of handling this form of replication quite effectively.
6
UNIT 08: Distributed Database JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Database Description
No replication Each database fragment is stored at a single location. There is no duplicate.
Data replication is very effective when the frequency of use of distant data is frequent and the database
is somewhat large. Another advantage of data replication is the ability to restore lost data at a specific
location.
Data Allocation
The process of determining where to store data is known as data allocation. It also entails deciding
which data should be kept where. Data might be allocated centralised, partitioned, or duplicated as
shown in Table 2:
Database Description
Centralised The complete database is kept in a single location. There is no dispersion.
Partitioned The database is separated into parts that are kept at several locations.
Replicated Copies of one or more database pieces are kept at several locations.
Data Fragmentation
Data fragmentation is the process of dividing a single object into two or more segments. A user’s
database, a system database, or a table could be the object. Over a computer network, any fragment can
be stored at any location. The distributed data catalog (DDC) stores information on data fragmentation,
which is retrieved by the TP to process user requests. At the table level, data fragmentation strategies
are based on dividing a table into logical fragments. There are three types of data fragmentation are as
follows:
Horizontal fragmentation: The division of a relation into subsets (fragments) of tuples is referred to
as horizontal fragmentation (rows). Each fragment is stored in its own node and has its own set of
rows. The unique rows, on the other hand, all have the same properties (columns). In a nutshell, each
fragment is a SELECT statement with a single WHERE clause on a single attributes.
Vertical fragmentation: The division of a relation into attribute (column) subsets is referred to as
vertical fragmentation. Each subset (fragment) is stored at a separate node and it has its own set of
columns, with the exception of the key column, which is shared by all fragments. This is similar to
the PROJECT statement in SQL.
Mixed fragmentation: A mix of horizontal and vertical fragmentation strategies is referred to as
mixed fragmentation. To put it another way, a table can be broken into numerous horizontal subsets
(rows), each with its own set of properties (columns).
8.2.3 Implementation
A distributed database is made up of several databases that are geographically distributed throughout
the world. A distributed database management system (DBMS) manages the distributed database and
makes it appear to users as a single database.
7
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
A distributed database is a collection of many interconnected databases that are physically scattered
across multiple places and interact over a computer network.
The databases in the collection are logically connected to one another. They frequently represent a single
logical database. Data is physically kept in numerous locations. Each site’s data can be maintained by
a DBMS that is independent of the other sites The processors at the locations are linked via a network.
A distributed database does not have a multiprocessor architecture. A distributed database integrates
transaction processing, but it is not synonymous with a transaction processing system.
Transaction Performance
transparency transparency
DDBMS
Transparency
DBMS Distributed
transparency transparency
Transaction Transparency
The transaction transparency makes that all distributed transactions maintain distributed database
integrity and regularity. Distribution transaction access refers to data stored in different locations.
Another point to keep in mind is that the DDBMS is responsible for maintaining the atomicity of each
sub-transaction (this means that the entire transaction either occurs directly or not at all). Due to the
use of fragmentation, allocation, and replication structures in DBMS, it is extremely complicated.
8
UNIT 08: Distributed Database JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Performance Transparency
A DDBMS that works in a similar way to a centralised database management system is required for
this transparency. Also, the system is distributed; there should be no performance loss. In a DDBMS, a
distributed query processor must be able to transform a data request into an ordered sequence of local
database operations. Another factor to consider is the database management system’s fragmentation,
replication, and allocation structure.
DBMS Transparency
This transparency is only applicable to heterogeneous forms of DDBMS because it hides the fact that the
local DBMS may be different (databases that have different sites and use different operating systems,
products, and data models). This is one of the most difficult transparency generalisations to use.
Distribution Transparency
The user does not need to know that the data is fragmented when a DDBMS has distribution data
transparency since it lets them recognise the database as a single object or logical unit. Distribution
transparency is further divided into three types:
Location transparency: Location transparency guarantees that any table(s) or fragment(s) of a
table may be queried as if they were stored locally on the user’s site. The end-user should be fully
unaware that the table or its parts are kept at a remote site in the distributed database system. The
remote site(s) address (es) and access procedures are entirely concealed.
To include location transparency, DDBMS should have access to an up-to-date and accurate data
dictionary, as well as a DDBMS directory that contains information about data locations.
Fragmentation transparency: Users can query any table as if it were unfragmented thanks to
fragmentation transparency. As a result, it conceals the fact that the table on which the user is
searching is actually a fragment or union of some fragments. It also conceals the fact that the
shards are scattered around the country.
This is analogous to SQL views, where the user may be unaware that they are using a view of a table
rather than the table itself.
Replication transparency: Replication transparency guarantees that database replication is
concealed from users. It allows users to query a table as though there is just one copy of the table.
Transparency in replication is linked to concurrency and failure transparency. When a user makes
a change to a data item, the change is mirrored in all copies of the table. However, the user should
not be aware of this procedure.
This is known as concurrency transparency. In addition, if a site fails, the user can continue with his
requests utilizing duplicated copies without being aware of the failure. This is an example of a lack
of transparency.
9
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
considered as an atomic unit of work. A query written in Structured Query Language (SQL) may involve
more than one transaction that includes read or write the data from various tables. Every day, many
database systems deal with a large number of transactions that renamed the database system as a
transaction processing system.
A transaction can involve four types of operations, which are described as follows:
READ (A): Reads data from the database tables that are already stored in the memory, where A
signifies the record that is to be read.
WRITE (A): Writes the record A that may be the result of a transaction back to the main memory or
to the disk.
COMMIT: Makes sure that all the changes made by the transactions are made permanent.
ROLLBACK: Terminates the transaction in case of an unsuccessful transaction and undoes any
changes made in the database.
Let’s take an example to clear the idea about how transaction takes place in case of a library management
system when a user is about to borrow a book B from the library online. Let’s assume that the account
number for the customer is A.
ACID Property
All transactions in a database follow certain properties, which help them to remain efficient. These
properties are called Atomicity-Consistency-Isolation-Durability properties or in short ACID properties.
We will define these properties in detail as follows:
Atomicity: Ensures that the whole unit of transaction should be performed at a time. We can say
that a transaction should be done completely or not at all. Therefore, the transaction needs to be
performed completely; otherwise, the whole transaction should be rolled back. This concept of
atomicity imposed on every transaction helps maintaining the log of every transaction.
Consistency: Means if we consider the database on which a transaction is about to be performed is
in State_1, then after the successful completion of the transaction, the state of the database should
be moved to State_2. This defines the consistency property, which ensures that the transaction
commit operation always transforms a database from one consistent state to another consistent
state. When a transaction fails, this property ensures that the database should return to its previous
consistent state.
Isolation: Refers to a very important property since it helps to execute the transaction without fail
or without any error. This property ensures that all the changes should be reflected on the database
until the transaction is committed. Isolation guarantees that the concurrent execution of the
transaction on the same database and on the same table should not affect all other transactions.
For example, in the library management system, if a user A places a request to buy book B, and at
the same time, the administer places a request that changes the book name B to B1, then this change
(whether fail or commit) should not reflect to the first transaction done by user A. Only the state that
has been read by transaction 1 should determine the result of the transaction.
Durability: Confirms that the changes made by the successful execution should be made permanent.
Thus, the changes should never be lost due to any kind of failure. We can say that transaction is
10
UNIT 08: Distributed Database JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
a basic unit of recovery. In a database, a log has been maintained that keeps the record of every
committed transaction.
State of Transaction
A transaction in its life cycle can have many states. When a transaction first hits a system, it starts as a
new program and waits for the CPU execution. When the CPU becomes free to execute the transaction,
its state changes from Wait to Execute. In the execution phase, a database goes to an inconsistent state
since the transaction changes the data values of the database. Here, if any internal error occurs, the
transaction may move into the Abort state; otherwise, for a successful execution of the transaction, it
moves to a consistent state known as Commit state, where all the changes made by the transaction are
made permanent. In case of any error during the execution, the transaction may go to the Rollback
state, where all the changes made by the transaction in the database are undone and the database is
moved to a consistent state that it had before the transaction. From here, the same transaction can be
executed as a new transaction back again.
Execute Abort
Wait Commit
Rollback
The main difference between the Abort and Rollback states shown in Figure 4 is that in the Abort state,
no changes are made to the database at all; thus, there is no need to undo the changes, the database
remains in its consistent state. On the other hand, in the Rollback state, the changes made by the
transaction should reflect on the database and these changes should be undone to keep the database in
its consistent state. In the Abort state, the transaction cannot be executed again as a new transaction,
whereas in the Rollback state, it can be executed again as a new transaction, which means that the
transaction goes into transaction queue.
To allow dispersed transactions, a new component called the global transaction manager is added.
The transaction’s origin site might temporarily assume the function of global transaction manager,
coordinating database operations with transaction managers from other sites. Transaction managers
expose their capabilities to application applications via an interface.
11
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Each transaction’s bookkeeping information, such as a unique identity, originating site, name, and
so on, is stored by the manager. If valid and available, it returns a local copy for READ operations. It
ensures that updates are visible across all sites that hold copies (replicas) of the data item for WRITE
operations. The manager ensures that no transaction consequences are reflected in any site of the
distributed database for ABORT procedures. It ensures that the consequences of a write are persistently
recorded on all databases having copies of the data item for COMMIT actions. The two-phase commit
technique is often used to implement atomic termination (COMMIT/ ABORT) of distributed transactions.
The database operation and associated information are passed from the transaction manager to
the concurrency controller. The controller is in charge of obtaining and releasing associated locks.
The transaction is postponed until the lock is acquired if it requires access to a locked resource. The
operation is passed to the runtime processor after the lock is gained, which handles the actual database
operation execution. Locks are released once the process is completed, and the transaction manager is
updated with the results.
Defining Timestamp
When a transaction first enters in a system, timestamp is the unique way to identify the transaction by
imposing the timestamp value in the transaction. We should make sure that no two transactions have
the same timestamp. Timestamp is measured as the start time of a transaction. We have different ways
to measure the timestamp for a transaction, which are as follows:
i. Whenever the first transaction enters the system, we should impose the timestamp value 1 and
maintain a counter that increments the value by 1 and assigns that value as timestamp for the next
transaction. Thus, the timestamp values for the next transactions will be 2, 3, 4, etc.
ii. Whenever a transaction enters the system, we can assign the system clock time as a timestamp
value of the transaction. In this case, the system should ensure that no two transaction enters the
system at the same time.
To describe the timestamp-based protocol, we will mention the timestamp for a transaction T as
TIMESTAMP (T). If Ti is the first transaction that enters in the system and Tj is the second transaction,
then we can say that the timestamp value of Ti should be smaller than the timestamp value of Tj. Thus,
we can write Tj > Ti. Since timestamp has been use to maintain the serializability for a schedule, it
should maintain the order of execution to produce a schedule, which is equivalent to a schedule that
follows the same order.
We can think of maintaining two types of timestamps depending on the types of operation in a
transaction. The types of timestamps are as follows:
WRITE_TIMESTAMP (A): Presents the WRITE operation timestamp, which is the largest timestamp
of all the transactions that have written the data item A successfully in the database. Thus, if T is the
12
UNIT 08: Distributed Database JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
youngest transaction that has written A, then we can write TIMESTAMP (T) = WRITE_TIMESTAMP
(A).
READ_TIMESTAMP (A): Presents the READ operation timestamp, which is the largest timestamp
of all the transactions that have read the data item A successfully in the database. Thus, if T is the
youngest transaction that has read A, then we can write TIMESTAMP (T) = READ_TIMESTAMP (A).
We can divide the timestamp ordering or timestamp-based protocol into three categories, where the
first represents basic timestamp-based protocol, the second represents strict timestamp-based protocol,
and the final is Thomas’s write rule.
Thus, basic timestamp-based protocol ensures conflict serializability by allowing to abort a transaction
when it does not follow the rules. The advantage of this protocol over 2PL is that we can avoid deadlock
of a transaction by allowing concurrent execution. However, starvation may occur if a transaction is
continually being aborted.
13
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
TIMESTMAP (T) > WRITE_TIMESTAMP (A), then the WRITE operation for the other transactions in queue
can be delayed until transaction T commits or aborts. The same rule holds for the WRITE operation
also. To implement this protocol, we need to simulate the locking operation along with the timestamp
technique, which ensures that when the locks are released, only then the other operations can take
place. This protocol also helps to avoid deadlock.
14
UNIT 08: Distributed Database JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Each site has a scheduler that maintains a separate queue for each transaction manager in order
to perform timestamp ordering techniques. A transaction manager sends a lock request to the site’s
scheduler during the transaction. The scheduler routes the request to the appropriate queue in ascending
timestamp order. Requests are handled from the front of the queues in the order of their timestamps,
i.e., the oldest requests are processed first.
A conflict graph is generated for the classes that include active transactions. There are vertical,
horizontal, and diagonal edges in this. A vertical edge links two nodes inside a class and signals class
conflicts. A horizontal edge joins two nodes from different classes and indicates a write-write conflict
between them. A diagonal edge links two nodes from different classes and indicates a write-read or
read-write conflict between the two classes.
The conflict graphs are examined to determine if two transactions within the same class or two
transactions from separate classes may be conducted concurrently.
15
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
16
UNIT 08: Distributed Database JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
8.7 GLOSSARY
17
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
18
UNIT 08: Distributed Database JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
https://vulms.vu.edu.pk/Courses/CS712/Downloads/Principles%20of%20Distributed%20
Database%20Systems.pdf
https://www.google.co.in/books/edition/Principles_of _Distributed_Database_Syste/
TOBaLQMuNV4C?hl=en&gbpv=1&dq=distributed+database&printsec=frontcover
Discuss with your friends and classmates the concept of distributed database. Also, discuss the
importance of distributed database.
19
UNIT
09 Data Warehousing
Names of Sub-Units
Introduction to Data Warehousing, Need for Data warehousing, Features of a Data Warehouse, Benefits
of Data Warehousing, Data Warehouse, Common Architecture for Data Warehouse, Data Warehousing
Design, Data Warehouse Design Strategies, Data Warehousing Life Cycle Data Warehouse and Data
Marts
Overview
The unit begins by discussing the concept of data warehouse. Next, the unit explains the need and
features of data warehouse. Further, the unit explains the data warehouse architecture and the
common architecture for data warehouse. Then, the unit discusses the data warehousing design. The
unit also explains the data warehouse design strategies. Towards the end, the unit discusses the data
warehouse and data marts.
Learning Objectives
Learning Outcomes
https://docs.oracle.com/en/database/oracle/oracle-database/21/dwhsg/database-data-
warehousing-guide.pdf
9.1 INTRODUCTION
A data warehouse is a database management system that facilitates online analytical processing by
allowing the data to be viewed in different dimensions or perspectives, to provide business intelligence.”
Data warehouse can also be referred as a collection of data designed to support decision-making process
of the management. Data warehouses contain various data that present a clear picture of the business
conditions at a single instant of time. The development of a data warehouse includes the development of
systems that are used to extract data from operating systems. The installation of a warehouse database
provides the managers with access to the data in a flexible manner. Data warehousing generally refers
to the combination of several different databases across a whole enterprise.
In 1980, Bill Inmon coined the term ‘data warehouse’. According to him, “A data warehouse is a subject-
oriented, integrated, time-variant and non-volatile collection of data in the support of management’s
decision-making process.” In this definition, it is very important to emphasis words like subject-oriented,
integrated, time-variant and non-volatile.
“Subject-oriented” signifies that the data points to critical subjects such as sales, inventory, etc.
“Integrated” signifies that the data is generated from various sources. “Time-variant” implies that the
data is stored in such a way that all the updated data is stored with the time it is updated. “Non-volatile”
implies that data is never removed, i.e., historical data is also kept. So, whenever a change takes place
in any of the field values of a table, the previous values also need to be kept in addition to the present
values.
2
UNIT 09: Data Warehousing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
organisations to obtain routine management information, which is required to take strategic decisions.
This information is collected through a number of complex queries.
For instance, the database of a TV manufacturing company contains a lot of information about the
profile of purchasers, time of purchase, model purchased, etc. The management may like to know the
answer to complex questions such as “which is the fastest selling model of TVs, in which period the
highest number of sets of this model have been sold, and what is the profile of the customers buying this
model?” This information is important to plan the production schedule and also to change the marketing
strategy so that the marketing team can focus on a particular market segment during a particular
period of time in a year. Such ad hoc complex information that needs to be obtained from the historical
data stored in the database, after necessary analytical processing, is called business intelligence. The
systems that provide business intelligence are now known as ‘data warehouses’.
Data warehousing in coordination with business intelligence tools and techniques facilitates decision
making by taking care of the following:
Risks management
Cost control
Revenues or business results
Customer satisfaction
Each decision can significantly impact organisations by reducing costs and increasing revenue through
the use of DWH and BI tools, for the following:
Avoiding problems, such as poor credit risks
Negotiating improvement in supply
Dropping unprofitable products
Reducing waste due to poor quality
Understanding and providing better customer service
Focusing on the most profitable products
Improving cross-selling to customers
Capitalising on trends
Capturing growing marketing opportunities
Besides these, the need for data warehousing also arises due to the following reasons:
Increasing demand for strategic information
Inability of past decision support system
3
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
organisation, a warehouse can be built that concentrates on sales. With the help of this warehouse,
we can know about the best customers for a particular item in the previous year. Since we can
define a data warehouse by a specific subject matter (in this case, sales) data warehouse is subject-
oriented.
Integrated: Integration is related to subject orientation closely as both are concerned with the
design of the data warehouse. The data should be put by data warehouses from different sources
into a consistent format. The problems among the units of measure, such as naming conflicts and
inconsistencies, must be resolved. When the problems are resolved, they are integrated.
Non-volatile: Non-volatile expresses that the data should not change once they have entered into
the data warehouse. This is logical as the aim of a warehouse is to provide us with the ability to
examine what has happened.
Time-variant: To find general developments and changes in business, analysts require huge amount
of data. This is highly in contrast to Online Transaction Processing (OLTP) systems, where moving
of historical data to an archive is a requirement. The term ‘time-variant’ means a data warehouse’s
attention to changes taking place over a period of time. Typically, on a monthly, weekly, or daily
basis, data flows from one or more OLTP databases into a data warehouse. A staging file is used to
process data, which flows in before adding them to the data warehouse.
Data granularity: It refers to the level of detail. Finer the data granularity, lower the details are
stored. In operational systems, data are stored with the lowest level of detail. However, we may
require data which are the aggregation of the lowest level of details. For example, sales details may
be stored on daily basis in the database; however, for analysis, we may require them on a monthly,
quarterly or yearly basis. In a data warehouse, we can keep data at the different levels of granularity
depending on the queries executed.
4
UNIT 09: Data Warehousing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
target database. In addition to the target database, there is another database to store the metadata,
called metadata repository or metadata database. This database contains data about data—description
of source data, target data and how the source data has been modified into target data. The basic
architecture of a data warehouse is shown in Figure 1:
Data Warehouse
Metadata Admin Console
database
Target Database
(Data ETL)
Figure 1 describes the architecture of a data warehouse. In this figure, the client software is used to
generate the business intelligence reports. The data from the data sources is subjected to the ETL
process and stored in the target database (Data ETL). The stored data is then processed and the outcome
is generated on the basis of the users’ queries. The user interface can be based on a custom-built GUI
or it can be a browser interface. The administration console is used to carry out data warehouse
administration including the periodic ETL process to update the data into the data warehouse from the
data sources. We will describe each of these components in the next sub-sections.
If an organisation has offices located in different places (perhaps in different countries), it is likely that
the database systems were also developed without any standardisation of the field names and their
5
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
attributes. While taking/using data from various data sources, a data warehouse developer must keep
the following points in mind:
The character representation of data taken from different data sources may be different (most
computers use ASCII code; the mainframes use EBCDIC format).
The table names may be different (customer details table can be customer_details in one database
and cust_details in another database).
The field names may be different (emp_name and Employee_name; both mean the same thing for
us but for computers, they are different).
The date may be stored in different formats in different databases (16 Sept. 15 or 16/9/15 or 9/16/2015).
Units of measure may be different (liters or gallons).
Currency values may be different (price field may contain dollars but in one database, US dollars
and in another database, Singapore dollars).
In some databases, the fields may be NULL.
Data warehouse specialists need to study the minute details of these data sources before starting with
the development of the warehouse.
Database
Quality Assurance Server
Data Staging
Operational
Systems
In other words, you can say that data staging is the process of data extraction from the source database,
carrying out the necessary transformations and loading into the target database as well as carrying
out the data quality assurance before a formal release of the data warehouse.
6
UNIT 09: Data Warehousing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
The entire ETL process is generally carried out through a GUI. The GUI facilitates:
Selecting of data sources (flat files, RDBMS, ERP, etc.)
Carrying out the necessary transformations
Loading the data on the target database
Creating metadata repository
Administrating the ETL process, monitoring the ETL process, scheduling the ETL jobs, and managing
users in a multi-user environment
Providing the facility to either clean the data (such as range checks, alerting on the missing data,
etc.) or to integrate third party tools for data quality analysis
Permitting rollback in case major errors are encountered in the ETL process
Debugging capability
Facility of error logging
Generating business intelligence reports using OLAP and query analysis
Analysing data from different dimensions
Generating ad hoc queries
7
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Mining data
Performing drill-down and drill-up operations
Performing what-if analysis
Generating a standard report and presenting it in graphical and tabular forms.
Setting alerts to give indications to the managers on the occurrence of important events (e.g., the
sales target for the month has been achieved)
As an example, OLTP systems may store each and every transaction of an ATM. An aggregate can store
the total transactions made by each individual in a month in a summarised form. Alternatively, an
aggregate may store only the total transaction amount on a daily basis without keeping individual
information. How much aggregation is required and what data has to be aggregated is decided at the
time of designing by keeping in mind the needs of business intelligence.
Consider the example of the sales of a company. Assume that for each product, there are three models.
Separate databases are maintained at different showrooms to track the daily transactions and take
the necessary reports at the end of the day and end of the month. Now, the company wants to develop
a data mart and decides to discontinue a particular brand and introduce a new brand in each of the
three products. The manager wants information periodically so that he can take the right decision at
the right time. A simple mechanism for this purpose is to use aggregates from which month-wise sales
of each product can be taken and summarised.
However, note that the manager’s requirements cannot be predicted. To start with, he may first ask
for the summarised data, and then he may go in for details. To change the view of the data to analyse
it in greater detail is called “drill-down”. To change the view of the data to view only the aggregates or
summaries is called “drill-up”.
8
UNIT 09: Data Warehousing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Raw Data
Raw Data
9
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Data going into the data warehouse come from data sources (operational systems and flat files) and
users (doing analysis, reporting, and mining). The operational data from the data sources first go to a
staging area.
9.6.3 Data Warehouse Architecture with a Staging Area and Data Marts
Data marts are built to focus on particular functions of business for a specific community within an
organisation. The data warehouse architecture with a staging area and data marts is shown in Figure 8:
Inventory
Flat Files Mining
Figure 8: Data Warehouse Architecture with a Staging Area and Data Marts
Besides the above-mentioned three types of architecture, data warehouse architecture can also be
categorised into the following:
Centralised data warehouse: In this type of architecture, queries and applications access data from
the central repository.
Independent data marts: In this type of architecture, different departments of an organisation
create data mart for their usage independently.
Federated: In this type of architecture, information sharing and interaction between numbers of
different systems take place. Data is either logically or physically integrated.
Hub and spoke: This type of architecture is based on the top-down approach in which all the
requirements of the organisation are analysed first and then the development takes place at an
enterprise-level.
Data mart bus: This type of architecture is based on the bottom-up approach in which individual
process requirements are analysed such as orders placed, deliveries to be done, customer calls or
billing.
10
UNIT 09: Data Warehousing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
business stages across the whole organisation. As a result, data warehouse design is a very complicated,
time-consuming, and hence error-prone process. Furthermore, business analytical functions evolve
with time, resulting in changes in system requirements. As a result, data warehouses and OLAP systems
are dynamic, with a constant design process.
The architecture of a data warehouse differs from the view of materialisation in the industry. It considers
data warehouses to be database systems with specific requirements, such as answering management-
related questions. The goal of the design is to determine how records from various data sources should
be extracted, transformed and loaded (ETL) and arranged in a database known as the data warehouse.
As this approach starts with building small prototypes first, it involves less process and cost. A single
data mart usually models a particular business area of a department such as “Purchases” or “Students
Admissions” in an institute. Data warehouses are then developed by associating these data marts. The
alliance of data marts in a data warehouse is concentrated around the integrated dimensions that
depict the potential coordination “points” between data marts. The genuine alliance of two or more
data marts is then completed by a procedure known as “drill across”. Drill across works by gathering or
summarising data along the keys of the conformed dimensions of every fact contributing in it followed
by a join on the keys of these summarised facts.
11
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
The most important management task ensures the dimensions among data marts are corrected
and can be trusted. Consider the situation in which an alliance of data marts can take place in an
organisation. Suppose we have created a Sales data mart in the Sales department. When the Sales data
mart is completed, the business might choose to take the warehousing activities into the “Production
department” which created the need of developing a Production data mart. For better coordination
between the Sales and Production departments, the Sales data mart and the Production data mart need
to be integrated.
Requirement
Specification
Test and
Data Modelling
Deployment
12
UNIT 09: Data Warehousing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
combinations depending on the needs. We may call this the data warehouse’s overarching plan.
However, this phase is mainly concerned with establishing business requirements and storing them
in the data warehouse.
Data Modelling: This is the second stage in the Data Warehouse’s development. Data modelling is
the process of visualising data distribution and creating databases by meeting the requirements
for transforming data into a format that can be kept in a data warehouse. For example, when we
begin building a house, we place everything in the exact position given in the plan. Data modelling
is what data warehouses are all about. Data modelling aids in the organisation of data, the creation
of links between data sets, and the establishment of data compliance and security that align with
data warehousing aims. This is the most difficult stage of data warehouse creation. Furthermore,
various data modelling approaches are used by corporations for warehouse design. Data modelling
is often done at the data mart level and then branches out into a data warehouse. It is the logic of
how data is stored in relation to other data. For data warehouses, there are three data models to
choose from:
Star Schema
Snowflake Schema
Galaxy Schema.
ELT Design and Development: This is the third phase in the Data Warehouse’s development. An ETL
tool, which stands for Extract, Transfer, and Load, may extract data from numerous source systems
and store it in a data lake. An ETL procedure may extract data from the lake, convert it, and put
it into a data warehouse for reporting purposes. ELT tools are required for optimal speeds, good
visualisation, and the ability to establish simple, reproducible and consistent data pipelines between
all current architecture and the new data warehouse. This is when ETL solutions such as SAS Data
Management, IBM Information Server, Hive, and others come into play. A solid ETL process may aid
in the creation of a basic yet functioning data warehouse that is used at all levels of the company.
OLAP Cubes: This is the fourth phase in the Data Warehouse’s development. An OLAP cube, also
known as a multidimensional cube or hypercube, is a data format that enables quick data analysis
based on the various dimensions that constitute a business challenge. A data warehouse would
extract information from a variety of data sources and formats, such as text files, excel spreadsheets,
multimedia files, and so on. The retrieved data is cleaned and converted before being placed into an
OLAP server (or OLAP cube) and pre-processed for further analysis.
Data processing and analysis are often carried out using a basic spreadsheet, with data values
organised in a row and column format. This is best suited for two-dimensional data. OLAP, on
the other hand, comprises multidimensional data, which is often gathered from disparate and
unconnected sources. Using a spreadsheet is not the best option. The cube will logically and orderly
store and evaluate multidimensional data. Data warehouses are now available as a fully constructed
solution that is flexible and capable of staging a variety of data kinds. OLAP cubes are becoming
obsolete because they cannot provide real-time analysis and reporting, and organisations today
need something with great performance.
13
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
UI Development: This is the sixth phase in the Data Warehouse’s development. So far, the processes
outlined have occurred on the backend. There is a requirement for a user interface for how an user
and a computer system interact, namely the use of input devices and software to quickly access the
data warehouse for analysis and report generation. The primary goal of a UI is to allow an user
to successfully control a device or system with which they are engaging. There are several tools
available on the market to assist with UI creation. For individuals that use Big Query, BI solutions
such as Tableau or PowerBI are excellent options.
Maintenance: This is the sixth phase in the Data Warehouse’s development. During this phase, we
can update or alter the schema as well as the application domain or needs of the data warehouse.
Data warehouse maintenance solutions must also provide a way to track schema changes, such as
updates. We may execute Insertion operations and alter dimensions and categories at the schema
level. Adding or removing user-defined properties is an example of a change.
Test and Deployment: This is frequently the final stage in the Data Warehouse development cycle.
Businesses and organisations test data warehouses to see whether or not the needed business
problems are effectively implemented. Warehouse testing necessitates the examination of massive
amounts of data. Data that must be compared come from a variety of sources, including relational
databases, flat files, operational data, and so on. The following steps are included in the entire data
warehouse project testing phases: data completeness, data transformation, data loading using ETL
tools, data integrity, and so on. We deployed the data warehouse after testing it so that users could
quickly access and analyse the data. Essentially, at this phase, the data warehouse is activated and
made available to the user. The majority of the data warehouse’s functions are implemented at the
time of deployment. The data warehouses may be installed in-house or on the cloud.
A data mart is a collection of datasets that support departments in making specific decisions. For
example, in an organisation, both marketing and sales departments have their own data marts. Each
department of an organisation completely owns the software, hardware, and other components that
form its data mart. As the data marts of every department are different from each other, it becomes
difficult even to manage and organise data across different departments of the same organisation.
Unlike a data mart, a data warehouse spans the entire organisation and does not belong to a particular
department. Data warehouses store more information and have a higher level of granularity as
compared to data marts, which have more summarised data and fewer granularities.
Data marts that store data required for specific business processes or departments are built from
the data warehouse. According to Bill Inmon, a data warehouse is in the middle of the “Corporate
Information Factory” (CIF), which provides a logical framework for providing business intelligence and
business management skills.
14
UNIT 09: Data Warehousing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
According to many vendors, data warehouses are not easy to build and are expensive. They make
you believe that creating data warehouses is just a wast of time. However, this is not accurate. These
vendors believe that data warehouses are obstacles that stop them from earning profits. They would tell
you about all the drawbacks of a data warehouse that you may encounter while implementing it. Some
vendors might also suggest you build a data warehouse by building a few data marts and let them grow.
Apart from data marts, there is also an operational data store that provides data to a data warehouse.
Operational data store is a database system that obtains data from different sources, consolidates it,
and keeps it at a single location. However, it does not have the sophistication of a data mart or a data
warehouse for analytical processing. It is generally developed as a pre-cursor for ERP systems.
A data warehouse is a database management system that facilitates online analytical processing
by allowing the data to be viewed in different dimensions or perspectives, to provide business
intelligence.”
Data warehouse can also be referred as a collection of data designed to support decision-making
process of the management.
Data warehouses are designed to help us examine data for supporting the decision-making process
of the management.
A data warehouse obtains data from a number of operational database systems, which can be based
on RDBMS or ERP packages.
The data from various data sources is converted into a form suitable to be stored in a data warehouse.
This process is called Extract, Transform and Load (ETL) of data into the target database.
Data staging is the process of transferring data from various data sources (operational systems)
into the target database of the data warehouse using the ETL process.
Data stored in a summarised form is called aggregate and the process of summarising is called
aggregation.
15
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Online Transaction Processing (OLTP) systems may store each and every transaction of an ATM.
A data warehouse is a single data repository that integrates records from many data sources for
online business analytics processing (OLAP).
The design and development process of a data warehouse is to be implemented by keeping in mind
the organisation’s needs and users’ requirements regarding reporting and analysing data.
The top-down approach is designed by taking the help of the normalised enterprise data model and
considers complete design and planning.
The basic idea behind bottom-up approach is to define and code the basic parts first and then these
developed parts are linked together to form a whole system.
Data Warehousing is a flow process that gathers and manages structured and unstructured data
from many sources into a centralised repository to make meaningful business decisions.
A data mart is a collection of datasets that support departments in making specific decisions.
9.12 GLOSSARY
Data warehouse: A database management system that facilitates online analytical processing
by allowing the data to be viewed in different dimensions or perspectives, to provide business
intelligence.
ETL: The process in which the data from various data sources is converted into a form suitable to be
stored in a data warehouse
Data staging: The process of transferring data from various data sources (operational systems) into
the target database of the data warehouse using the ETL process
Data mart: A collection of datasets that support departments in making specific decisions
Multi-source data mart: A data mart for which the input data is obtained from multiple sources
16
UNIT 09: Data Warehousing JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
https://www.oracle.com/in/database/what-is-a-data-warehouse/#:~:text=A%20data%20
warehouse%20is%20a,large%20amounts%20of%20historical%20data.
https://cloud.google.com/learn/what-is-a-data-warehouse
17
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Discuss with the classmates about the data warehouse. Also, discuss the various architecture of
data warehouse.
18
UNIT
10 Dimensional Modelling
Names of Sub-Units
Dimensional Model, The Star Schema, Query Execution Process, The Snowflake Schema, Fact Tables
and Dimension Tables, Factless Fact Table, Updates to Dimension Tables, Primary Keys, Surrogate
Keys and Foreign Keys, Aggregate Tables, Fact Constellation Schema or Families of Star
Overview
The unit begins by discussing the concept of dimensional model. Next, the unit explains the star
schema, query execution process, and the snowflake schema. Further, the unit discusses the fact
tables, dimension tables, and factless fact table. The unit also discusses the updates to dimension
tables. The unit then discusses the primary keys, surrogate keys and foreign keys. Towards the end, the
unit explains the aggregate tables as well as the fact constellation schema or families of star.
Learning Objectives
Learning Outcomes
https://www.sciencedirect.com/topics/computer-science/dimensional-model
10.1 INTRODUCTION
Dimensional modelling (DM) is a data structure approach that has been improved for data storage in a
data warehouse. The goal of dimensional modelling is to improve the database for quicker data retrieval.
Ralph Kimball created the dimensional modelling idea, which comprises “fact” and “dimension” tables.
A dimensional model for a data warehouse has been used to analyse, consolidate and assess numeric
data. In a real-time Online Transaction System, however, relation models are suitable for inserting,
modifying, and removing files. These dimensional and relational models each have their own approach
to storing data, which offers distinct advantages. In the relational model, for example, normalisation
and ER models eliminate data redundancy. A dimensional model in a data warehouse, on the other
hand, organises data in such a way that it is easy to access information and produce reports.
As a result, dimensional models are utilised in data warehouse systems and are not well suited to
relational systems.
In data warehouses, the dimensional method includes the use of fact and dimension tables to keep track
of past data. Different data modelling strategies are better suited to specific purposes. Normalised
ER models are designed to remove data redundancy and to conduct Add, Modify and Remove actions
rapidly, as well as to get data into a database.
2
UNIT 10: Dimensional Modelling JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Data is de-normalised and used in data warehouse Data is normalised and used for OLTP.
and data mart.
Few tables and fact tables are connected to dimensional Several tables and chains of relationships among
tables. them.
Non-volatile and time invariant. Volatile (several updates) and time variant.
It is used for user friendly, interactive, drag and drops It is used for normal reports.
multidimensional OLAP Reports.
One fact table for data organisation. One table per entity.
3
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Revenue
Dealer_ ID
Model_ ID
Branch_ ID
Date_ ID
Dimension Table Units_ Sold Dimension Table
Revenue
For example, as shown in the figure above, the fact table is in the center and has keys to each dimension
table such as, Deal_ID, Model_ID, Date_ID, Product_ID, Branch_ID and other attributes such as Units
sold and revenue.
4
UNIT 10: Dimensional Modelling JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
To process a query, SQL Server goes through three steps: parsing, optimising and execution. The
optimised plan is scheduled as a consequence of the third phase, during which calls are made to the
Storage Engine to obtain the data that form the results of the query you are running.
5
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
SQL Query
Parser/Translator
Database
Optimiser
Catalog
6
UNIT 10: Dimensional Modelling JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Assume a company provides services to customers. Every sale is a fact that happens, and all these facts
are documented in the fact table. For example:
In this case, the fact table’s customer ID column is the foreign key that connects to the dimension table.
Following the linkages reveals that row 2 of the fact table captures the information that customer 3,
Sally Jones, purchased two goods on day 8. The corporation would also have a product table and a
timeline to establish what Sally purchased and when she purchased it.
There are physical and data constraints when creating fact tables. The eventual size of the item, as
well as access pathways, should be taken into account. Adding indexes can assist with both of these
issues. However, there should be no constraints in terms of logical design. Tables should be constructed
with current and future needs in mind, with as much flexibility as possible built in to allow for future
expansions without having to recreate the data.
For example, the characteristics of a customer dimension may contain first and last name, birth date,
gender and so on, whereas the attributes of a website dimension would comprise the site name and URL.
A primary key column in a dimension database uniquely identifies each dimension record (row).
Fact tables use keys to relate to dimension tables. A system-generated key is used to uniquely identify a
row in a dimension table when it is created in a data warehouse. This key is sometimes referred to as a
surrogate key.
7
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
In the dimension table, the surrogate key serves as the primary key. The surrogate key is assigned to
the fact table, and a foreign key is established between the two tables. When the data is connected, it is
joined in the same way as any other join in the database.
The only thing they have is a shortened key. They continue to reflect focused phenomena that are
recognised by the dimension table combination. Figure 4 shows the factless fact table:
Employee Key
Employee Dimensions
Customer Key
Account Key
Customer Dimensions
8
UNIT 10: Dimensional Modelling JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
9
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
This is viable if the size varies only occasionally, such as once or twice a year. For example, the product
dimension, which includes thousands of rows, changes seldom, making it manageable. However, in
the case of customer dimensions, where the number of rows is in the millions and changes seldom,
type 2 updates are doable and not very complex. Type 2 modifications are problematic and difficult
when client dimensions change quickly.
If the dimension table is rapidly changing and huge, divide it into one or smaller dimension tables.
A table can only have one primary key, which might be a single or several fields. Composite keys are
utilised when numerous fields are used as the main key.
If a table has a primary key declared on any field(s), no two records can have the same value in that field
(s). Keep in mind that you will apply these principles while designing database tables.
If a table lacks a natural primary key, we must generate one artificially to uniquely identify a row in the
database; this key is known as the surrogate key or synthetic primary key of the table. The surrogate key,
however, is not usually the main key. If numerous items in a database are linked to the surrogate key, we
will have a many-to-one relationship between the primary keys, and the surrogate key and surrogate
key cannot be used as the main key.
10
UNIT 10: Dimensional Modelling JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
A foreign key is a key that connects two tables. This is also known as a reference key. A foreign key is a
column or set of columns whose values match those of a Primary Key in another table. The connection
between the two tables aligns the Primary Key in one table with the Foreign Key in the other. If a table
has a primary key declared on any field(s), no two records can have the same value in that field (s).
To understand the concept of the foreign key, you create the following:
CREATE TABLE EMP (
Data are summarised in the aggregate table. Aggregated tables make up the materialised view. For
example, in sales, we only have date transactions. Suppose we want to produce a report that shows
annual sales by product. We group the data values into the week, month, quarter and year in such
circumstances. The @aggrtegate function is used to obtain dates from these tables.
It provides aggregated data that can be computed using various aggregated functions such as count,
avg, min, max and so on. In an OLAP database, aggregated tables are the most commonly utilised tables.
Aggregate functions are functions that combine the values of a table or column into a single value.
The information will be placed into the aggregate table with the help of aggregate functions. The
aggregate tables are used for scalability, and the information is sent quickly. Aggregations implemented
at the database level enhance the query performance by hitting the aggregate table instead of the table
immediately and fetching data.
For example, if a table includes data from the years 2020 and 2021, and the customer needs an exact
figure of records in the table periodically, such as, weekly, monthly, and annually, we must establish
an aggregate table that stores the monthly count. For quarterly, we must create a separate table and
squeeze the quarterly count into it; and we must use those tables in reports so that report performance
improves dramatically.
11
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
The logical structure of a data warehouse or data mart is described by the Fact Constellation Schema.
Fact Constellation Schema may be used to create designs using a collection of de-normalised FACT,
Shared and Conformed Dimension tables.
Prod_ no
Product Product
Prod_ name
Quarter Future_ qtr
Prod_ descr
Region Region
Prod_ style
Revenue Projected_ revenue
Prod_ line
The Fact Constellation Schema is a complex database design that makes it difficult to summarise
information. Fact Constellation Schema may be used to implement between aggregate Fact tables or to
deconstruct a complicated Fact table into isolated simplex tables. Tables of data.
Dimension tables are shared by the fact tables of the STARS in a family. The time dimension is usually
shared by the majority of the fact tables in the group. The time dimension is likely to be shared by all
three fact tables in the above example. In the other direction, dimension tables from many STARS may
share a fact table from a single STAR.
12
UNIT 10: Dimensional Modelling JGI JAINDEEMED-TO-BE UNIVERSIT Y
Fact
Table
Dimension Fact
Dimension
Table Table
Table
Fact
Table
Dimension
Table
Dimension Dimension
Table Table
It makes sense to collect individual transactions as well as snapshots at defined intervals if you are in a
business such as banking or telephone services. Then you may utilise STARS families that are made out
of STARS.
Dimensional Modelling (DM) is a data structure approach that has been improved for data storage
in a data warehouse.
A dimensional model for a data warehouse has been used to analyse, consolidate, and assess
numeric data.
A dimensional model in a data warehouse organises data in such a way that it is easy to access
information and produce reports.
Dimensional modelling uses a cube procedure to represent data, rendering OLAP data management
appropriate for logical data representation.
In data warehouses, the dimensional method includes the use of fact and dimension tables to keep
track of past data.
The star schema is the most basic kind of Data Warehouse schema. It is referred to as a star schema
because its structure resembles a star.
13
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
A snowflake schema is an expansion of a Star Schema that adds dimension. It’s named a snowflake
because the diagram looks like a snowflake.
A fact table is a core table in a data warehouse’s star structure. A fact table holds quantitative data
for analysis and is frequently denormalised.
A dimension table is a table that maintains qualities, or dimensions, that describe the objects in a
fact table.
Factless tables merely refer to the fact that no treatments are accessible.
Updating owing to a change in the fact table occurs quite seldom. When compared to fact tables,
dimension tables are more stable.
A primary key is a column in a database table that uniquely identifies each row/record. Primary
keys must have distinct values.
Surrogate key, also known as a synthetic primary key, is created when a database automatically
inserts a new record into a table that may be defined as the main key of that table.
A foreign key is a key that connects two tables. This is also known as a reference key.
In general, aggregate tables are just database tables that hold aggregated values.
Aggregate tables are tables that hold existing warehouse data that has been aggregated by a set of
dimensions.
A fact constellation is made up of two or more fact tables that share one or more dimensions. It is
also known as the Galaxy schema.
10.15 GLOSSARY
Dimensional modelling (DM): A data structure approach that has been improved for data storage
in a data warehouse
Star schema: The most basic kind of Data Warehouse schema because its structure resembles a star
Snowflake schema: An expansion of a Star Schema that adds dimension
Fact table: It holds quantitative data for analysis and is frequently denormalised
Dimension table: A table that maintains qualities or dimensions, that describe the objects in a fact
table
Primary key: A column in a database table that uniquely identifies each row/record
Surrogate key: A key that is created when a database automatically inserts a new record into a
table that may be defined as the main key of that table
Aggregate tables: Tables that hold existing warehouse data that has been aggregated by a set of
dimensions
14
UNIT 10: Dimensional Modelling JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
15
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
https://www.astera.com/type/blog/dimensional-modeling-guide/
https://www.holistics.io/books/setup-analytics/kimball-s-dimensional-data-modeling/
Discuss with your friends about dimensional modelling. Also, discuss the real-life examples of
dimensional modelling.
16
UNIT
Names of Sub-Units
Need for OLAP, OLTP vs. OLAP, OLAP Operations: Roll-up, Drilldown, Slice, Dice, Pivot, OLAP Models:
ROLAP, MOLAP, HOLAP, and DOLAP
Overview
This unit begins by discussing the concept of online analytical processing (OLAP). Then, the unit
discusses the need for OLAP. Further, the unit explains the difference between OLTP and OLAP. Next,
the unit discusses the OLAP operations. Towards the end, the unit discusses the OLAP models.
Learning Objectives
Learning Outcomes
http://web.mit.edu/profit/PDFS/SlaughterA.pdf
11.1 INTRODUCTION
To attain informed business decisions and take the required actions, business analysts and executives
often need to analyse business data from different aspects. Consider the case of a business organisation
that sells different household products across different regions of the country. To arrive at an appropriate
decision about increasing the sale of a particular product, business executives need to analyse the data
according to the market trends and popularity of the product in various regions. The business executives
may also need to compare the marketing statistics of a product during different seasons of a year or a
particular period in different years. To perform the comparison, the business data needs to be stored in
multidimensional databases. The selective extraction and analysis of the data are then accomplished by
using Online Analytical Processing (OLAP) method.
OLAP is a processing approach that performs multidimensional analysis of business data. It also
facilitates the performance of complex calculations, trend analysis and complicated data modeling.
OLAP is different from Online Transaction Processing (OLTP), which is another processing approach
characterised by various online transactions, including INSERT, UPDATE and DELETE. OLTP processing
approach gives very fast query processing and helps in maintaining data integrity in multi-access
environments.
Before starting with the OLAP study in detail, we must first understand the need for OLAP. The chapter
starts on this note and then it clearly indicates the difference between OLAP and Online Transaction
Processing (OLTP). It introduces how OLAP is efficient in multidimensional analysis and talks about
hypercubes. Finally, the chapter talks about various popular OLAP tools in use.
2
UNIT 11: Online Analytical Processing – OLAP JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
As the user required an interactive interface for complex analysis or aggregate calculations, it was the
need of the hour to perform the multidimensional analysis effectively. Hence, the need for OLAP arose.
OLAP is intended for data access and analysis to aid management users in strategic decision-making.
The goal of OLAP technology is to aggregate datasets into multidimensional displays while maintaining
system performance. OLTP systems are “customer-oriented,” whereas OLAP systems are “market-
oriented.”
3
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
It is difficult to combine OLAP and OLTP into a single centralised database system. The dimensional data
design model utilised in OLAP is far superior to the relational database query employed in the OLTP system
for querying. OLAP may utilise a single central database as a data source, but OLTP may use several data
sources from separate database locations. The dimensional architecture of OLAP is unsuitable for OLTP
systems, owing to redundancy and the loss of data referential integrity. The firm opts for two distinct
information systems, one OLTP and one OLAP. We may infer that OLTP systems are designed to bring data
into computers, whereas OLAP systems are designed to get data or information out of computers.
4
UNIT 11: Online Analytical Processing – OLAP JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
with the location aggregated by city values, the time aggregated by quarters, and the item aggregated
by item kinds.
11.4.1 Roll-up
The roll-up operation (also known as the drill-up or aggregation operation) aggregates data on a data
cube by ascending down idea hierarchies, i.e., dimension reduction. Roll-up is analogous to zooming out
on data cubes. The graphic depicts the outcome of roll-up operations on the dimension location. The
location hierarchy is described as the Order Street, city, province or state, nation. The roll-up procedure
gathers data by rising the location hierarchy from the city level to the national level.
When dimensions reduction is used to perform a roll-up, one or more dimensions are eliminated from the
cube. Consider a sales data cube with two dimensions: location and time. Roll-up can be accomplished by
deleting the time dimensions that occur in an aggregate of total sales by place, rather than by location
and time.
Figure 1 shows the roll-up operation:
USA
2000
Canada
Q1 1000
(Quarter)
Q2
Time
Q3
Q4
Chicago
440
New York 1550
Toronto 395
Vancouver
Q2
Time
Q3
Q4
5
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
In Figure 1, we are performing the roll-up operation for hierarchies of location dimensions (from cities
to countries).
11.4.2 Drilldown
The drill-down operation (also known as roll-down) is the inverse of the roll-up procedure. Drilling down
is similar to zooming in on a data cube. It moves from less detailed to more detailed data. Drilling down
may be accomplished by either stepping down a concept hierarchy for a dimension or by introducing
new dimensions.
USA
440
New York 1550
Toronto 395
Vancouver
Q2
(Quarter)
Time
Q3
Q4
Chicago
440
New York 1550
Toronto
395
Vancouver
Jan 150
Feb 100
Mar 150
Time (Month)
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Mobile Modern Phone Security item(types)
6
UNIT 11: Online Analytical Processing – OLAP JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Figure 2 depicts a drill-down operation on the dimension time conducted by stepping down a concept
hierarchy specified as day, month, quarter and year. Drill-down appears by decreasing the time
hierarchy from the quarter level to a more specific month level.
A drill-down may also be achieved by adding a new dimension to a cube since it adds more information
to the supplied data.
11.4.3 Slice
A slice is a subset of cubes that correspond to a single value for one or more-dimension members. A
slice operation is used when a client wishes to make a selection on one dimension of a three-dimensional
cube, resulting in a two-dimensional site. As a result, the Slice operations choose one dimension of the
provided cube, resulting in a sub cube. Here, we get details about one dimension by keeping other
dimensions at a particular level. Figure 3 shows the slice operation:
Location Location-Maharashtra
2010 P1 P2 P3 P4
2010 100 96
2011 100 96
2011 20 37
2012 20 37
2012 40 82
Time 2013 40 82
2013 60 63
2014 60 63
2014 80 42
P1 P2 P3 P4 2015 80 42
11.4.4 Dice
The logical partitioning of a cube considering more than one dimension is called dice operation. Here,
we get a sub-cube by selecting two or more dimensions. Figure 4 shows dice operation:
M. P
Location
Maharashtra
2010 100 96
Time
2011 20 37
P1 P2 P3 P4
Location
Figure 4 shows a dice operation considering the time in 2010 and 2011 and locations as Maharashtra
and M.P.
7
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
11.4.5 Pivot
A pivot action is also known as a rotation. Pivot is a visualisation procedure that rotates the data axes in
view to show the data in a different way. It might include shifting the rows and columns or transferring
one of the row measurements to the column dimensions. Pivot is a way of visualising a cube from
different dimensional perspectives. In pivot, an analyst can rotate a cube to see its different faces. It is a
visualisation operation. Figure 5 shows pivot operation:
Time
2011
2010 100
Maharashtra 100
U. P 200
Location
A. P 300
P1 P2 P3 P4
Product
Complex
Data SQL
Warehouse Client
Request
RDBMS Analytical
Server Server
8
UNIT 11: Online Analytical Processing – OLAP JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
In this architecture, the user interacts with the database frequently. The query is processed by the OLAP
server and/or database server.
This method allows for the creation of several multidimensional views of two-dimensional relational
databases, removing the need to structure data around the desired view. Few products in this category
have robust SQL engines to handle the challenge of multidimensional analysis.
The performance of a ROLAP is determined by the size of the data being processed; it might be sluggish
when the data being processed is huge and quick when the data is little. Although any SQL tool may
access ROLAP, these tools have limitations since SQL statements do not match all of the demands of all
users, especially when doing complicated computations. This includes writing several SQL statements
to manage user requests and being able to write SQL statements based on the optimiser of the DBMS
engine. While ROLAP’s flexibility is a novel feature, there are some products that need the usage of a
denormalised database design.
Required Summarised
Create and
Precompute
Data data cubes Proprietary
Warehouse data language Client
RDBMS MDBMS
Server Server MDDB
9
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
The architecture of MOLAP system is shown in Figure 7. The system consists of:
OLAP client that provides the front-end GUI for giving the queries and obtaining the reports.
OLAP server is also known as Multidimensional Database Management System (MDBMS) server.
This is a proprietary database management system which stores the multidimensional data in
‘multidimensional cubes’ and contains the data in summarised form, based on the type of reports
required.
A machine that carries out the data staging, which converts the data from RDBMS format to MDBMS
and sends the ‘multidimensional cube’ data to OLAP server.
10
UNIT 11: Online Analytical Processing – OLAP JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
MOLAP products may encounter problems when updating and querying models with dimensions
more than 10.
MOLAP is incapable of storing comprehensive data.
If the data set is dispersed, storage usage may be poor.
Hence, multidimensional OLAP is used when the database size is small, say less than 100 GB and the
data is available in the summarised form.
So, we can have a midway. In HOLAP, the relational database structure is preserved to handle simple and
user-required queries. Instead of computing all the cubes and storing them in MOLAP server, HOLAP
server stores only some important and partially computed cubes or aggregates so that when we require
higher scalability and faster computation, the required aggregates can be computed efficiently. Thus,
HOLAP possesses the advantages of both ROLAP and HOLAP. Figure 8 shows the architecture of HOLAP:
RDBMS
Server
SQL Query
In a HOLAP system, the user may access data from both the relational and multidimensional
databases by utilising basic queries for relational databases and complicated aggregate searches for
multidimensional databases. Because the querying process is complex and the same queries may be
performed on a multidimensional system, it is more effective to design a cached data server just for
aggregates and data that is repeatedly requested from the relational database.
The Analysts or Reporting experts, who are the Users, query the HOLAP to retrieve the relevant data.
Querying encompasses both relational and multidimensional databases. Simple searches for relational
databases convey data to the cache system, while complicated queries are kept in the cache system with
the data queried for the current flow. The data retrieved from both systems is then offered to the user
for analysis or creation of reports for corporate decision-making processes.
11
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
This improves the speed of the HOLAP system while consuming a large amount of storage space. Because
HOLAP can benefit from both ROLAP and MOLAP, the loss of storage capacity is unavoidable.
11.5.4 DOLAP
When the OLAP communicates with desktop databases to retrieve the data, it is called Desktop OLAP
(DOLAP). It is a technology by which the user is able to download a part of the OLAP locally and then may
perform the required processing locally. DOLAP provides the data after performing a multidimensional
analysis of the client machine, which has already obtained the data from multidimensional database
servers.
DOLAP is a single-tier technology that enables users to download a small hypercube on their desktop
machine from a central point or server to perform multidimensional analyses. This central point can be
a data mart or data warehouse. Note that in the case of DOLAP, during the process of multidimensional
analyses the client machine remains disconnected from the server. The downloaded cubes, generally,
contain summarised data or are organised in a fixed structure of dimensions.
12
UNIT 11: Online Analytical Processing – OLAP JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
The only disadvantage of DOLAP is that it provides limited functionality and data capacity.
Ans. In the scenario of the hospital with three dimensions Doctor, Patient, Time and measures Count
and Fees, the basic OLAP cube is shown as follows:
02 03 01
04 01 01
Fact: Count
Q1 (No of times
04 01 01
Patient visited
Doctor)
Q2 07 10 01
Time
Q3 02 02 03
Q4 01 02 03
P1 P2 P3
Patient
P1 P2 P3
Patient
13
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
And, there are basically five OLAP operations, which are as follows:
Slice: The following figure shows the Slice operation:
500
1000
Q1 1000
Q2 2000
Time
Q3 500
Q4 100
P1
Patient
500 1000
1000 200
Q1 1000 200
Time
Q2 2000 4000
P1 P2
Patient
Feb 400 00 00
P1 P2 P3
14
UNIT 11: Online Analytical Processing – OLAP JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
2011
2012
P1 P2 P3
Patient
Q4 100
100 100
Q3 500 500 400
Q2 2000 4000 100
Q1 1000 200 100
D2
The roll-up operation (also known as the drill-up or aggregation operation) aggregates data on a
data cube by ascending down idea hierarchies, i.e., dimension reduction.
15
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
The drill-down operation (also known as roll-down) is the inverse of the roll-up procedure. Drilling
down is similar to zooming in on a data cube. It moves from less detailed to more detailed data.
A slice is a subset of cubes that correspond to a single value for one or more-dimension members.
The logical partitioning of a cube considering more than one dimension is called dice operation.
Pivot is a visualisation procedure that rotates the data axes in view to show the data in a different
way.
Relational OLAP is the preferred technology when the database size is large, i.e., greater than 100
GB.
MOLAP architecture was proposed with the idea that multidimensional cubes must be pre-computed.
In HOLAP, the relational database structure is preserved to handle simple and user-required queries.
DOLAP is a technology by which the user is able to download a part of the OLAP locally and then
may perform the required processing locally.
11.8 GLOSSARY
16
UNIT 11: Online Analytical Processing – OLAP JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
https://minigranth.in/data-warehouse-tutorial/olap-operations
https://hevodata.com/learn/olap-models/
17
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Discuss with your friends the concept of OLAP and OLTP. Also, discuss the operations and models of
OLAP.
18
UNIT
12 ETL Process
Names of Sub-Units
Need of ETL Process, Data Extraction, Identification of Data Sources, Immediate Data Extraction,
Deferred Data Extraction, Data Transformation, Techniques of Data Loading, Loading the Fact Tables,
Loading the Dimension Tables, Challenges in ETL Functions
Overview
The unit begins by discussing the concept of ETL and the need of ETL process. Then, the unit explains
the data extraction. Further, the unit outlines the data transformation and tasks involved in data
transformation. Next, the unit covers the techniques of data loading. After this, the unit explains how
to load the fact tables and dimension tables. Towards the end, the unit sheds light upon the challenges
in ETL functions.
Learning Objectives
Learning Outcomes
https://www.immagic.com/eLibrary/ARCHIVES/GENERAL/WIKIPEDI/W121031E.pdf
12.1 INTRODUCTION
ETL is a process that takes data from various source systems, transforms the data (such as applying
computations, concatenations and so on) and ultimately loads the data into the Data Warehouse system.
ETL stands for Extract, Transform and Load.
It is easy to believe that constructing a Data warehouse entails merely pulling data from many sources
and loading it into a Data warehouse’s database. This is far from the case and a sophisticated ETL
procedure is required. The ETL process, which is technically demanding, necessitates active participation
from a variety of stakeholders, including developers, analysts, testers and senior executives.
To preserve its usefulness as a decision-making tool, the data warehouse system must evolve in tandem
with business developments. ETL is a periodic (daily, weekly, monthly) function of a data warehouse
system that must be agile, automated and properly documented.
In the first step of ETL, data is extracted from the data stores, such as text file, spreadsheets or incoming
data. The data different from the last execution of the ETL process is extracted. Here we use different
methods of extraction, such as logical extraction and physical extraction methods. Then, the data
cleaning procedure is executed. The next step is a transformation where we can include simple data
conversions and extremely complex data scrubbing techniques at the same time. Error correction
technique is applied in this step. The final step is loading where we load the data into the data warehouse.
2
UNIT 12: ETL Process JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
3
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
4
UNIT 12: ETL Process JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
physically. In this method of extraction, we have to assume whether the source object or the prepared
source object is being used by the distributed transactions.
Offline extraction: In this method of extraction, the data do not deal with the sources directly, but
they are produced outside the source system. The data are either in a pre-existing format (for
example, archive logs, transportable tablespaces or redo logs) or designed by some extraction
routine. A few of these external structures are:
Flat files: In flat files, the data are in a defined, generic format. Some additional information is
required about the source object for further processing.
Dump files: Dump files are created in the Oracle-specific format. The information about the
objects can be included or may not be included. It depends on the incorporated utility.
Redo and archive logs: These are special additional dump files that contain information.
Transportable tablespace: It is a powerful medium to extract the data and then moves the
huge amount of data between Oracle databases. Transportable tablespaces must be advisable
as they have added advantages in terms of efficiency and full-proof processing over other
extraction techniques.
To store data in a data warehouse, the first step is the identification of the source. It is one of the most
critical steps involved in the process of data extraction.
The steps involved in source identification are as follows:
Every data item stored in the data warehouse needs to be associated with the input source.
Identify the fact and various attributes involved with the dimension tables.
For multiple data sources for the same field, use the proper preferred source.
Use appropriate rules to map a single data item to be distributed to multiple fields.
Find missing values, corrected values and default values.
5
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Option 1:
Capture through
transactions logs Source
Databases
Source
Source Operational
Data Trigger Systems
Programs
Transaction Log
DBMS
Files
6
UNIT 12: ETL Process JGI JAINDEEMED-TO-BE UNI VE RSI TY
Options 2:
Capture by
comparing files Options 1: Capture
Extract file based on data Extract file
based in File and time stamp based on time
File
comparison stamp
Comparison
Programs
Extract
Source Programs
Databases
Yesterday’s
Extract Source
Source Operational
Data Systems
Today’s
DBMS
Extract
7
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
If any source data does not comply with the transformation requirements, it is rejected before being loaded
into the target DW system and stored in a reject file or reject table. The transformation criteria are not stated
for the straight load columns data from source to destination (which does not require any changes). As a
result, data transformations may be divided into two types: basic and sophisticated. Column conversions,
data structure reformatting and other data transformations are examples of data transformations.
8
UNIT 12: ETL Process JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
During the data transformation process, you must decode such codes into correct values that
business users can comprehend. As a result, the preceding codes can be altered to Active, Inactive
and Suspended.
8. Values calculated and derived: DW can hold additional column data for computations by taking
into account the source system data. Before saving information in DW, you must do the calculations
based on the business logic.
9. Conversion of Date/Time: This is one of the most important data kinds to focus on. The date/time
format may change between source systems. For example, the data in one source could be dated
December 15, 1999. Another source could have the same date as of 12/15/1999. As a result, all date/
time values should be translated into a standard format throughout the data transformation.
10. De-duplication: If there are duplicate records in the source system, ensure that only one record is
imported into the DW system.
Figure 3 shows the flow of data transformation:
Start
Is
data No
Splitting, joining, conversions, valid?
summarisation, format revisions,
decoding, etc.
Yes
Perform the
transformation tasks
Report the error
Stop
9
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
As soon as the data are loaded into a database, the SQL operations are performed for data transformation.
The four basic techniques for implementing SQL data transformations are as follows:
CREATE TABLE ... AS SELECT and INSERT /*+APPEND*/ AS SELECT: The CREATE TABLE ... AS SELECT
statement is a strong tool to change large sets of data. The INSERT /*+APPEND*/ ... AS SELECT
statement offers the same features as that of the existing database tables.
Generally, data transformation is done through data substitution. In data substitution, a few or all
the values of a single column are changed. Let us assume that our sales table has one of the column
names as channel_id.
In a data warehouse, data can be received from multiple points/sources. This column shows that
the mentioned sales transaction is generated by the own sales force, which is a form of a direct sale.
Also, it can be generated by a distributor which is another form of indirect sale. Therefore, the NULL
values must be given a proper key value.
In the following example, we us the SQL function. Here, we insert the record by inserting the various
fields into the sales table.
The structure of the source table sales_monthwise_direct is as follows:
DESC sales_monthwise_direct
Name Null? Type
Invoice_DATE DATE
Item_ID NUMBER
Cust_ID NUMBER
Promocode_ID NUMBER
Total_Amt NUMBER
Total_Quantity NUMBER
In the preceding code, the data are inserted from sales_monthwise_direct into the sales table. Here,
we truncate the sales data values to the midnight time using SQL function and we have fixed channel
ID and a given value of 13:
INSERT INTO sales SELECT item_id, cust_id, TRUNC(Invoice_date), 13,
promocode_id, Total_quantity, Total_amount
FROM sales_monthwise_direct;
Transforming Data Using UPDATE: Data substitution can be carried out by using an UPDATE
statement to adjust the sales.channel_id column. The UPDATE statement gives the right result.
However, if a large number of rows need to be adjusted as per the data substitution transformation
10
UNIT 12: ETL Process JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
changes, then the CREATE TABLE AS SELECT (CTAS) statement is proved more proficient than an
UPDATE statement.
Transforming Data Using MERGE: The MERGE keyword is used to update or insert a row conditionally
into a table. Conditions are specified in the ON clause.
Transforming Data Using Multitable INSERT: In data warehouse environments, it is quite common
to diversify the same source data into several target objects. Thus, multitable insert proves to be
a new solution for transformations in the form of a SQL statement. On the basis of the business
transformation rules, the data may have several or exactly one target in case of multitable INSERT.
In multitable INSERT, the data can either end up in several or exactly one target on the basis of the
business transformation rules. The insertion process is usually done on the basis of business rules
or unconditionally. The multitable INSERT statement gives more benefits than the INSERT ... SELECT
statement where multiple tables are involved. In doing so, one drawback is that you need to work
with n independent INSERT … SELECT statements, thereby processing the same source data n times
and increasing the transformation workload n times.
For better performance in the direct-load mechanism, we can use the existing statements, for
example, INSERT…SELECT, in parallel with the new statements.
A transformation that is processed using a procedural language can be encapsulated using the ETL
process.
11
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
The mechanisms which are used for loading a data warehouse are as follows:
Loading a data warehouse with SQL*Loader: In this type of loading, the raw data must be accessed
to the database before any data transformation. For this, the data are loaded into the database.
Here, we use a technique for the data that are to be transported in terms of flat files.
We use SQL*Loader for transferring data from flat files to an Oracle data warehouse. When this
data load happens, SQL*Loader is used to implement basic data transformations. While using
direct-path SQL*Loader, various data manipulation operations, such as converting data type and
handling simple NULL values, can be automatically resolved during the data load. Generally, direct
path loading is used by data warehouses to enhance performance.
The conventional-path loader provides a more deep facility for data transformation than a direct-
path loader. SQL functions can be applied to any column as those values are being loaded. During
the data loading operation, it provides a rich capability for transformations. However, a direct-
path loader is faster than a conventional-path loader. That’s why the conventional-path loader is
considered primarily only when the amount of data to be loaded and transformed is less.
The file that contains the sales transaction sh_sales.dat is combined on a daily level. This is an
external file and all the columns are not loaded into sales. These files are used for loading the second
fact table and in this case, the sh sample schema is produced.
Loading a data warehouse with external tables: The external data sources are also handled using
external tables. External tables are those that do not reside in the databases. They can be in any
format. We use the external data in the form of a virtual table, which is an external table feature of
Oracle. You can query this virtual table and join it directly or in parallel without loading the external
data in the database. SQL, PL/SQL and Java are used to access the external data.
By using external tables, the phase of loading and the transformation are pipelined, thus providing
uninterrupted data streaming. For the next stage, we do not need the data that have been already
staged because the data might contain the comparisons or transformation of data.
For example, the functionality of conversion of a conventional load can be used for a direct-path
statement like INSERT AS SELECT in conjunction with the SELECT statement from an external table.
The major difference between external tables and regular tables is that the external tables are read-
only. Therefore, no DML operations, such as UPDATE or INSERT or DELETE, can be executed and no
indexes can be created on them. External tables are particularly useful in places where the complete
external source has to be joined with the existing database objects or where the data have to be
transformed in a complex manner. For example, you may use the arbitrary SQL transformation and
direct path insert method instead of different SQL*Loaders.
You may create an external file by the name of sh_sales.dat in which we represent the external table
by the name of sales_transactions_ext. The product and time will be represented in the department
12
UNIT 12: ETL Process JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
of Product where the cost will be analysed. The cost will be analysed in the fact table of cost in the sh
schema. As we are not analysing every dimension table, the operational source data are similar to
the sales fact table. This is due to the fact the sales fact table does not have the coarser granularity
as in the cost fact table (aggregation is applied on all various distribution channels).
Due to the suppression of a few of the dimensions, we cannot load the data in the cost fact table.
This cost table also needs to implement the aggregation of the detailed information; otherwise, we
cannot load the information in the cost fact table. The solution to this problem is to use the external
table framework where we use the SQL*Loader.
Loading data warehouse with OCI and direct-path APIs: Flat file system is not needed when the
computation and transformation processes are performed outside the database.
Loading data warehouse with export/import: When we want to insert data in the same way as we
have received the data into the target system, the features of import and export are used.
Loading takes place in three phases in the data warehouse. The first is ‘Initial load’ in which we
populate all the data warehouse tables for the first time. The second one is the ‘Incremental Load’ in
which we apply ongoing changes periodically and the third one is ‘Full Refresh’ in which contents of
one or more relations are erased completely and then reloaded with fresh data. These three phases
are explained as follows:
Initial load: In initial loading, you can load the whole data warehouse in a single run or split the
load into separate sub loads.
Incremental load: Incremental loading performs loading periodically and keeps a track of the
changes using the source system after the initial load is performed.
Full refresh: Data are periodically rewritten in the entire data warehouse. It means the entire
existing data will be erased or truncated before writing the new data.
The data in the data warehouse can be applied using the following four modes:
Load: The data are loaded in empty tables; if the tables already exist, then the data are erased
and reloaded.
Append: If the data do not exist, they are added to the record; else, if the tuple already exists, it
may be duplicated or rejected depending on the user.
Destructive merge: If the record is new, we add the record to the warehouse. The record is
updated in the case when there is a match in the primary key of the new record against the
existing record.
Constructive merge: If there is a match in the primary key of the new record with the current
record, then in this case, we leave the old record and at the same time add the new record as
superseding the old record.
13
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
For example, in the sales fact table, a measure can be the sales quantity presented in Table 1:
A data mart can contain one fact table to address one subject. In such a case, when several data marts
are integrated to create a data warehouse, it is important that the facts in each table mean the same
thing. Such facts (i.e., measures or metrics) that have the same meaning in different data marts are
called conformed facts.
Granularity of data is a very important factor in the design of a fact table. In a fact table, we can represent
data at different atomic levels called grains. In the above example of sales data, if the transaction data
do not record the model number of the product sold to each customer, then the sales volume represents
the quantity of the product, not each model. So, some useful data are lost and we cannot get business
intelligence related to models of the products. In the fact table, what detail has to be stored (i.e., what
granularity is required) is decided based on the business intelligence reports to be generated.
A dimension table has a primary key and a number of attributes. The primary key uniquely identifies
the row in the dimension table and is also used to link to the fact table. The attributes in this table
describe the dimension.
As an example, consider a data mart that analyses the sales data. In a multidimensional model, we
need to create a fact table and different dimension tables, such as product dimension, time dimension,
location dimension and customer dimension. Conceptually, this is very useful because we can analyse
sales data from different views or different dimensions. For example, we can analyse the sales patterns
of different brands/models, sales in different regions based on sales to different customers and sales at
different periods in a year. So, from the dimension tables, you can get the information, such as:
Persons of what average monthly income purchased a particular model?
When a particular model’s sales were the highest?
As another example, in a data warehouse for ATMs, the fact table contains the attributes: transaction
amount as the fact and also account number, transaction number and a number of foreign keys. The
dimension tables are transaction dimension, time dimension, location dimension and account dimension.
14
UNIT 12: ETL Process JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
The time dimension is a dimension you will come across in many data marts and data warehouses. The
time dimension can contain the following attributes:
Time_ID (an integer)
CalendarMonth (small integer)
CalendarQuarter (small integer)
CalendarYear (small integer)
FiscalMonth (small integer)
FiscalQuarter (small integer)
FiscalYear (small integer)
If the calendar year and the fiscal year are the same, then there is no need for separate attributes for
calendar year and fiscal year.
The concept of a surrogate key is important in a data warehouse. Surrogate means ‘deputy’ or ‘substitute’.
Surrogate key is a small integer (say, 4 byte integer) that can uniquely identify a record in a dimension
table. However, it has no meaning. Some data warehouse experts suggest that the production keys
used in the databases should not be used in dimension tables as primary keys. Instead, in their place,
surrogate keys have to be used, which are generated automatically.
ETL is a process that takes data from various source systems, transforms the data and ultimately
loads the data into the Data Warehouse system.
The ETL process, which is technically demanding, necessitates active participation from a variety of
stakeholders, including developers, analysts, testers and senior executives.
ETL is a periodic function of a data warehouse system that must be agile, automated and properly
documented.
15
JGI JAIN
DEEMED-TO-BE UNI VE RSI TY
Advanced Data Management Techniques
Data extraction is the process of collecting data from a database or SaaS platform so that it may be
duplicated to a destination built to allow online analytical processing, such as a data warehouse (OLAP).
ETL is used to prepare data for analysis or business intelligence (BI).
The logical extraction methods used for extracting the data logically are known as logical extraction
methods.
In the online method of extraction, the data deal with the sources directly.
In offline extraction, the data do not deal with the sources directly, but they are produced outside
the source system.
To determine which data sources will be used to populate the data warehouse.
Extraction occurs at the same time as the transactions are taking place at the source database and
files.
In the deferred data extraction technique, data extraction does not capture the changes in real-time.
Transformation is the process of applying a set of rules to extracted data before immediately putting
the data from the source system into the destination system.
Loading is relatively simpler than the other two processes. It is desirable to disable any constraints and
indexes before starting the load process and enable them back only after the load process completes.
A fact table is central table that contains the measures or facts of a business process.
A dimension table gives the context of measure/metric indicated in the fact table.
12.13 GLOSSARY
ETL: A process that takes data from various source systems, transforms the data and ultimately
loads the data into the data warehouse system
Data extraction: The process of collecting data from a database or SaaS platform so that it may be
duplicated to a destination built to allow online analytical processing
Transformation: The process of applying a set of rules to extracted data before immediately putting
the data from the source system into the destination system
Fact table: The central table that contains the measures or facts of a business process
Dimension table: It gives the context of measure/metric indicated in the fact table
https://www.astera.com/type/blog/etl-process-and-steps/
https://medium.datadriveninvestor.com/understanding-extract-transform-and-load-etl-and-its-
necessity-in-data-analytics-world-with-an-64346016153d
Discuss with your friends the concept of ETL. Also, discuss how to the implement the ETL process in
data analysis.
17
UNIT
Names of Sub-Units
BI Architecture, BI in Today’s Perspective, BI Process, Applications of BI, Result Pattern and Ranking
Analysis, Balanced Scorecard, BI in Decision Modelling, Optimisation, and Ethics and BI
Overview
The unit begins by discussing the concept of business intelligence. Next, the unit discusses the BI
architecture and today’s perspective of BI. Further, the unit explains the applications of BI. The unit
also discusses the result pattern and ranking analysis. Then, the unit discusses the balanced scorecard
and decision model in BI. Towards the end, the unit discusses the optimisation and ethics in BI.
Learning Objectives
Learning Outcomes
https://www.jatit.org/volumes/research-papers/Vol9No1/9Vol9No1.pdf
13.1 INTRODUCTION
Business intelligence (BI) is the process of transforming data from an organisation into insights that
can be utilised to guide business decisions. BI analysts will access and analyse datasets using BI tools,
software or services, and then transform their findings into reports, summaries, dashboards, graphs,
charts or maps.
The introduction of sophisticated data visualisation and reporting technologies in recent years has
altered the discipline, allowing firms to leverage big data insights to find, develop and create new
business prospects.
Business intelligence (BI) is a collection of procedures, structures, and technology that transform raw
data into useful information that drives lucrative business activities. It is a collection of tools and
services designed to turn data into actionable insight and knowledge. Business intelligence (BI) has a
direct influence on an organisation’s strategic, tactical and operational business choices. BI allows for
fact-based decision-making based on previous data rather than assumptions and gut impressions. BI
tools analyse data and generate reports, summaries, dashboards, maps, graphs and charts to offer
users in-depth information on the nature of the business.
BI systems provide present (real-time), historical and predicted views of internally organised data
pertaining to all divisions inside a business, which greatly increases operational understanding and
decision-making.
2
UNIT 13: Business Intelligence (BI) JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Simply said, business intelligence is the act of identifying important trends or patterns in data to make
more efficient, accurate judgments about your company’s goals, objectives and plans.
A variety of business intelligence principles are implemented through accessible, interactive tools and
dashboards, a centralised place that allows you to easily delve deep into your data.
Business intelligence (BI) is a technology-driven process for evaluating data and delivering actionable
information to leaders, managers and employees to help them make educated business choices.
Organisations collect data from internal and external IT systems, prepare it for analysis, run queries
against the data and create data visualisations, BI dashboards and reports to make the analytics results
available to business users for operational decision-making and strategic planning as part of the BI
process.
The ultimate purpose of BI projects is to drive better business choices, allowing firms to grow revenue,
enhance operational efficiency and gain a competitive advantage over their competitors. To do this, BI
combines analytics, data management and reporting technologies, as well as numerous approaches for
organising and analysing data.
13.3 BI ARCHITECTURE
Business intelligence is described as a collection of mathematical models and analysis procedures that
employ existing data to develop information and knowledge that can be used in complicated decision-
making processes. A business intelligence system gives information and knowledge gathered from data
to decision-makers.
Logistics
Operational Multidimensional cubes
Systems Data Exploratory data analysis
Marketing Time series analysis Data
Warehouse
mining Optimization
Performance
Evaluation
External Data
ETL Tools
Figure 1: BI Architecture
3
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
4
UNIT 13: Business Intelligence (BI) JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Business intelligence illuminates all parts of your business, making it easier to find opportunities for
development and allowing you to be proactive rather than reactive.
Useful information: An excellent Business Intelligence system identifies crucial organisational
patterns and trends. A BI system also enables you to comprehend the ramifications of various
organisational processes and changes, allowing you to make educated decisions and take
appropriate action.
Increased efficiency: BI Systems aid in the improvement of organisational efficiency, which in
turn boosts production and, perhaps, income. Business intelligence platforms enable organisations
to easily communicate critical information across departments, saving time on reporting, data
extraction and data analysis. Making information exchange easier and more efficient allows firms
to reduce duplicate roles and activities, allowing people to focus on their job rather than data
processing.
Sales intelligence: Both sales and marketing teams want to maintain track of their customers and
the majority of them use the Customer Relationship Management (CRM) program to accomplish so.
CRMs are intended to handle all customer contacts. They include a plethora of data and information
that may be evaluated and used for strategic goals since they hold all consumer contacts and
interactions. BI systems assist businesses in a variety of ways, including discovering new consumers,
tracking and keeping existing ones, and offering post-sale services.
Real-Time information: When executives and decision-makers must wait for reports to be generated
by several departments, the data is prone to human error and risks becoming obsolete before it
is even presented for evaluation. BI systems give customers real-time access to data through a
variety of tools like spreadsheets, graphic dashboards and scheduled emails. When using Business
Intelligence technologies, large volumes of data can be swiftly and precisely absorbed, evaluated
and delivered.
Competitive edge: In addition to all of these wonderful advantages, Business Intelligence may assist
you in gaining insight into what your rivals are doing, helping your firm to make informed decisions
and prepare for future ventures.
13.5 BI PROCESS
BI comprises a broad category of applications and techniques that are used for collecting, storing,
analysing and providing easy access to data to help business users in making better business decisions.
Business intelligence structures are developed to use all of your fundamental business information to
assist management in effective decision-making.
5
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Reporting: The final task is to use the newly gathered knowledge to improve the business and to
prepare a report that helps in making efficient decisions.
13.6 APPLICATIONS OF BI
BI serves in different areas, such as financial, statistical, sales, CRM, etc. The following are the different
areas where the BI is used:
Application of BI in Financial analysis: The process of evaluating the efficiency and sustainability
of organisations, programs, budgets and other financial processes is known as financial analysis.
Financial analysis is commonly used to determine if a business is stable, solvent, liquid or lucrative
enough to merit monetary investment.
Financial analysis is used to assess economic trends, formulate financial policy, create long-term
business goals and select projects or firms for investment. This is accomplished by combining
financial figures and data. A financial analyst will extensively review the income statement, balance
sheet and cash flow statement of a corporation. Financial analysis may be carried out in both
corporate finance and investment finance contexts. One of the most popular methods of analysing
financial data is to compute ratios from the data in the financial statements and compare them to
those of other firms or the company’s own historical performance.
Statistical analysis: The collecting and evaluation of data to discover patterns and trends are
known as statistical analysis. It’s a part of data analytics. Statistical analysis may be used for a
variety of purposes, including obtaining research interpretations, statistical modeling and creating
surveys and studies. It can also be beneficial for business intelligence firms that must deal with big
amounts of data.
Statistical analysis in the context of business intelligence (BI) entails collecting and examining every
data sample in a collection of objects from which samples may be derived. In statistics, a sample is
a representative selection derived from a larger population.
The purpose of statistical analysis is to discover patterns. A retail company, for example, can utilise
statistical analysis to discover trends in unstructured and semi-structured consumer data that can
be used to improve the customer experience and enhance sales.
Statistical analysis may be divided into five separate steps, which are as follows:
Describe the type of data that will be studied.
Investigate the data’s relationship to the underlying population.
Develop a model to explain your knowledge of how the data connects to the larger population.
Demonstrate (or refute) the model’s validity.
Use predictive analytics to conduct scenarios that will assist in guiding future actions.
Sales Analysis: Let’s start with a definition of sales analysis. A sales analysis is a comprehensive
study that provides a more in-depth insight into a company’s sales performance, client data and
income. In essence, you may measure and get insights on indicators such as present and historical
sales, upcoming patterns and a plethora of others that are important to your business.
6
UNIT 13: Business Intelligence (BI) JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
It does not have to be boring figures or boring paragraphs. It may be represented graphically using
bar graphs and charts. Here are some other KPIs to consider like Regional Sales, Average Purchase
Value, Sales per Rep, Quote-to-Closure, Cannibalisation rate, Sell-through rate, Lead conversion
rate, Product performance, Sales to date, Sales opportunities, Sales objectives and Sales growth.
Customer relationship management (CRM): Customer Relationship Management (CRM) refers to
the system of building and nurturing a relationship with customers. It involves the services offered
by the company to its customers. Successful businesses depend on identifying the needs of their
customers and fulfilling them with superior products and services.
Organisations conduct market researche to collect data about the demographics, needs and buying
habits of customers. After analyzing this data, organisations can identify the real requirements
of the customers. This helps them in achieving a competitive advantage and ensuring customer
satisfaction.
Most organisations believe that CRM is all about utilising customers’ data for better marketing.
Organisations often flood customers with advertisements about their products and services without
even knowing their actual requirements. This may upset existing as well as potential customers,
who may refrain from having any further transactions with the organisation. In such a scenario,
organisations require deeper insights and intelligence to improve customer satisfaction. This
knowledge should be applied to understanding the customer’s objective and matching it with the
business objectives, both of which are integral to the success of a business.
Figure 2 shows the uses of a CRM system in an organisation:
Customer’s Details
Marketing Management
Consumer
Feedback
Behavior
Production Management
Personal Service
Details Details Financial Management
7
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
a system made by stitching the modules of CRM together. The following points briefly explain how
CRM is linked to BI:
Guidance: BI often provides advice to the CRM department. Suppose a sales manager needs to
determine how to sell a specific product to a particular customer. In that case, he/she can use a
BI tool such as CRM to revise the CRM guidelines that provide assistance to sales representatives
in communicating with the customers.
Documentation: Sales managers document CRM data and combine it with their BI data. For
example, a sales manager can collect data about different types of customers who purchased
certain products and services. In addition, he/she can analyse this data to understand the best
way of selling products or services to similar customers in the future.
Provides software and practices of sales managers: CRM usually provides the software
and practices of sales managers to keep track of customer user accounts. Information on the
requirements, sales histories, feedback and contacts of customers might be fed in to the CRM
accounts. BI also helps in the improvement of customer relationships with the help of tools such
as data mining, decision support systems, etc., which enable the analysis of correct information
on customer behavior. These systems also help managers to make informed business decisions.
Inter-dependence: BI and CRM are dependent on one another as sales professionals often
require information about inter-departmental activities. For example, while working with a new
customer, a salesperson might discover that there is no past information to assist him/her in
determining the most appropriate selling point to use with the customer. In such a scenario,
the sales professional can use BI to understand how to approach the customer on the basis of
data related to other similar customers who are from the same industry, background, income
bracket or region.
Launch of advertising and marketing campaigns: Another useful association between CRM
and BI facilitates the advertising and marketing department to launch successful campaigns.
Marketing professionals can use CRM data to understand which products or services might
appeal to a particular kind of customer. For example, after researching CRM data, a marketing
professional for a publishing house can use BI tools to know which publications are preferred
by individuals.
Generate new intelligence: Many professionals comprehend another important association
between CRM and BI in the sense that CRM data can be used to generate new intelligence.
For example, a sales manager may recognize the fact that a particular product can be sold
successfully to people of a certain area. In such a situation, he or she will realize that this data is
valuable for BI. In the future, the sales manager might use this new BI insight to direct sales and
marketing more effectively.
8
UNIT 13: Business Intelligence (BI) JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
The balanced scorecard is a statement that evaluates managerial performance in a structured manner.
Key Performance Indicators (KPIs) are used to assess the executive team’s progress in implementing the
plan and achieving the set objectives. To judge the business’s growth rate and how it compares to its
competitors, performance is measured against established objectives.
Other employees further up in the organisational structure may use the balanced scorecard to
showcase their value to an organisation’s growth or their suitability for promotions and pay raises.
9
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
The fundamental elements of a balanced scorecard are an emphasis on a critical area relevant to the
company and the use of both economic and non-data to make goals.
Balance Scorecard
Business Organisational
Financial Customer
Processes Capacity
Business Outcomes
10
UNIT 13: Business Intelligence (BI) JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
venture outside of its comfort zone and see itself through the eyes of the client rather than only from
the inside.
Increasing customer satisfaction, boosting the shopping experience for customers and altering the
costs of a company’s primary products are some of the techniques that an organisation can use to
enhance its brand value among customers.
Organisational capacity viewpoint: Assessments and comments from the employees are developed
to determine upgrading skills and information gaps. Personnel must be trained on the most up-to-
date techniques and equipment to achieve top performance.
Employees are provided opportunities to learn and grow. The assessment is depending on the
number of certificates received and the number of weeks of training done.
Organisational performance is crucial in attaining positive outcomes through optimising goals and
objectives. Employees in the organisation’s departments are supposed to meet high standards in the
context of uniformity, corporate culture, strategic alignment and technical skills.
For the organisation to meet management’s goals, an appropriate structure is needed. For example,
the company should use trying to cut technologies to convert activities and keep processes flowing
smoothly.
Business analysts may use decision modeling to create and share a detailed documentation of how their
business choices to function. A decision model is made up of two views:
Decision requirements view: The Decision requirements view explains how complicated decisions
are broken down into simpler sub-decisions and what data and business knowledge are needed to
support them. It depicts the decision’s hierarchical structure.
Decision requirements diagrams are an effective tool for demonstrating the integration of business
rules, analytics, optimisation models, business know-how and source data inside an architecture.
They make all of this easier by offering a framework for merging them into a cohesive whole. Subject
matter specialists believe that developing this point of view is an effective approach to eliciting and
solidifying their understanding of critical decision-making issues.
Decision logic view: The decision logic view defines decision logic. It frequently, but not always, use
Decision Tables to depict how decisions influence their consequences. This enables business analysts
to articulate how decisions are made in their firms. Decision models can also express logic in various
formats (for example, decision trees or text) or connect with other decision-making methods (i.e,
analytics, cognitive models and optimisation engines).
Decision models are about more than simply requirements. Because of the accuracy of the decision
logic perspective, decision models may be immediately performed. As a result, they can be evaluated
11
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
against real-world business data and even put into production systems without requiring developers
to transform them into code. This model-driven decision architecture technique lowers mistakes and
allows for a quick change in decision-making it provides safe agility for business decision-making
and guarantees that your decision-making specification is always clear and consistent with the
implementation.
12
UNIT 13: Business Intelligence (BI) JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Insufficient reason criterion: This criterion assumes that without any experience, it is impossible
or unnecessary to assign any probability to the condition of the circumstance. In this instance, too,
probability can be allocated because there is no criterion for doing so. The circumstance is known as
the insufficient reason criteria or the la place criterion.
13.11 OPTIMISATION
In Business Intelligence, business optimisation is the process of increasing an organisation’s efficiency,
productivity and performance. This is true for both internal activities and external goods. Big data
and connected technologies may be utilised to impact new ideas that promote business optimisation,
helping firms to achieve greater success. Improved data modeling, machine learning and artificial
intelligence are assisting firms in identifying best-fit digital solutions and novel techniques that promote
improvement.
These ethical problems arise due to the adoption of BI methodologies, data mining methods, and
decision support systems and should not be ignored. Although the progress in information technology
and knowledge has generated countless opportunities, it has also raised various risks related to data
security. These risks can be prevented and avoided by implementing strict and appropriate control rules
and mechanisms. Improper usage of data by public and private organisations, which do not respect the
right to privacy of individuals, should not be tolerated.
These rules also restrict organisations from taking exclusive and unilateral benefits at the expense of
consumers and workers. However, the possibility of crossing the boundary between the corrective and
intrusive usage of information always remains high, particularly in the field of relational marketing
and Web mining. For example, it is not ethical for organisations to use and circulate any type of
personal information of individuals gathered through websites, links or cookies, with other commercial
organisations without the consent of users. Apart from the respect for the right to privacy, the social
responsibility of enterprises is another ethical issue which has been in discussion by the past few years.
13
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
There may be a difference in opinion on whether a company should focus on the short-term maximization
of profits, acting solely in the interest of shareholders or on the social consequences of its decisions.
Resolving conflicts related to the use of data in BI requires acknowledgment, discussion and the hard
work of defining ethics-based policies and developing a culture of ethical conduct.
Business intelligence (BI) is the process of transforming data from an organisation into insights that
can be utilised to guide business decisions.
Business intelligence (BI) is a collection of procedures, structures and technology that transform
raw data into useful information that drives lucrative business activities.
Business intelligence ideas relate to the use of digital computing technologies like data warehousing,
analytics and visualisation to locate and analyse critical business-based data to develop new,
actionable corporate insights.
Business intelligence system gives information and knowledge gathered from data to decision-
makers.
BI comprises a broad category of applications and techniques that are used for collecting, storing,
analysing and providing easy access to data to help business users in making better business
decisions.
Customer Relationship Management (CRM) refers to the system of building and nurturing a
relationship with customers.
A business result pattern is defined as a set of generic solutions that can be created and used in a
particular circumstance (a context) to solve one or more of the underlying difficulties.
Ranking analysis is a straightforward job rating approach in which positions are evaluated based
on an informal appraisal of their overall relevance to the firm.
A balanced scorecard is a management planning framework that firms use to priorities their
goods, projects and operations, communicate their purposes or objectives and organise their daily
operations.
Decision Modeling (DM) is a subset of decision management that focuses on expressing business
choices using standardised notation for usage by business analysts and subject matter experts
rather than developers.
Ethics can be defined as the challenge of determining the right from wrong and good from the bad.
13.14 GLOSSARY
Business intelligence (BI): The process of transforming data from an organisation into insights that
can be utilised to guide business decisions.
Customer Relationship Management: The system of building and nurturing a relationship with
customers.
14
UNIT 13: Business Intelligence (BI) JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Balanced scorecard: A management planning framework that firms use to priorities their goods,
projects and operations communicate their purposes or objectives and organise their daily
operations.
Decision modeling: A subset of decision management that focuses on expressing business choices
using standardised notation for usage by business analysts and subject matter experts rather than
developers.
Ethics: The challenge of determining the right from wrong and good from the bad.
3. A balanced scorecard is a management planning framework that firms use to priorities their
goods, projects and operations, communicate their purposes or objectives and organise their
daily operations. Organisations will use the scorecard to study and analyse their programmes’
effectiveness to decide how successful they were. Refer to Section Balanced Scorecard
4. Ethics can be defined as the challenge of determining the right from wrong and good from the bad.
BI has raised many ethical questions regarding how to gather information and how to utilise that.
We are moving towards an era in which every BI program should actively manage ethics. Refer to
Section Ethics and Business Intelligence
15
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
5. BI serves in different areas, such as financial, statistical, sales, CRM, etc. The following are the
different areas where the BI is used:
Application of BI in Financial analysis: The process of evaluating the efficiency and
sustainability of organisations, programs, budgets and other financial processes is known as
financial analysis. Financial analysis is commonly used to determine if a business is stable,
solvent, liquid or lucrative enough to merit monetary investment.
Refer to Section Applications of BI
https://www.ibm.com/topics/business-intelligence
http://ijariie.com/AdminUploadPdf/BUSINESS_INTELLIGENCE__CONCEPTS__COMPONENTS_
TOOLS TECHNIQUES BENEFITS_AND_CHALLENGES_ijariie5711.pdf
Discuss with your friends and classmates the concept of business intelligence. Also, discuss the
architecture and process of BI.
16
UNIT
Names of Sub-Units
Web Interfaces to the Web, Overview of XML: Structure of XML Data, Document Schema, Querying
XML Data, Storage of XML Data, XML Applications, Semi Structured Data Model, Implementation
Issues, Indexes for Text Data, Lab Exercise
Overview
The unit begins by discussing the concept of Web interface to the Web. Next, the unit discusses the
concept of XML. Further, the unit explains the structure of XML data, document schema, querying
XML data, storage of XML data, XML applications. The unit also explains the semi structured data
model. The unit then discusses the implementation issues. Towards the end, the unit discusses the
indexes for Text Data.
Learning Objectives
Learning Outcomes
https://matfuvit.github.io/UVIT/predavanja/literatura/TutorialsPoint%20XML.pdf
14.1 INTRODUCTION
A Web database is a database application that can be administered and accessed through the Internet.
Website administrators may control this data gathering and display analytical findings depending on
the data in the Web database application. Databases originally debuted in the 1990s and have proven to
be a valuable tool for organisations, allowing them to collect virtually unlimited amounts of data from
an infinite number of clients.
Semi-structured data is a hybrid of structured and unstructured data that combines the best of both
worlds. It also adheres to a specific schema, is consistent and exists to save up space and provide clarity.
Semi-structured documents include CSV, XML and JSON. NoSQL databases are widely used to manage
semi-structured data.
To design a web interface, you must first realise that it is by default configured to only function over
your HTTP or hypertext transfer protocol. You may also set your web interface to function over HTTPS
(secure hypertext transfer protocol). This provides you with the necessary method for safeguarding
your online management traffic. With your website interface, you may utilise the majority of common
web browsers, including Firefox, Internet Explorer and Chrome.
Web interface is a method that allows you to interact with the program or material that you are seeing
through your web browser on a distant server. Your web server downloads the content of your web page,
2
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
allowing you to interact with it using your browser. Your browser fulfils the role of a client. Because of
the distributed structure of your web browser, you can save your material on a remote server in addition
to gaining quick content access. Because of these types of connections, a massive quantity of data is now
available. Webmail, instant chat, online shopping, online document sharing and social networking are
among the most commonly used web apps.
Unlike HTML tags, which indicate how to show the data, XML tags identify the data and are used to store
and organise it rather than determining how to display it. XML will not replace HTML in the foreseeable
future, but it will provide new opportunities by incorporating many effective HTML elements.
XML has three key qualities that make it helpful in a wide range of systems and solutions:
XML is expandable XML allows you to design your own self-descriptive tags or language, that is
appropriate for your application.
XML transports data but does not deliver it XML allows you to store data regardless of how it will
be presented.
XML is a public standard XML created by the World Wide Web Consortium (W3C) and is available as
an open standard.
3
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Let’s understand the structure of preceding XML document in the following sections:
XML declaration
XML elements
XML attributes
XML tree
XML comments
The XML declaration statement is used to indicate that the specified document is an XML document.
Although it is not required to have an XML declaration; however, it is considered good practice to include
it. The XML declaration statement is the first line in the document, which defines the XML version and
character encoding. The following code snippet shows how to define the XML declaration statement:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
The attributes of the XML declaration statement are described as follows:
Version: Represents the version of XML. This attribute must be provided in the declaration.
Encoding: Specifies the character encoding that the document uses. You can specify encoding
through meta-data (data about data) stored in a file. XML allows a document to specify its own
character-set by defining an encoding declaration inside the XML declaration. This is an optional
attribute in the XML declaration.
Standalone: Contains two values, yes and no. The yes value specifies that the XML document has
an internal DTD. The no value specifies that the XML document is linked to an external DTD or any
external entity references.
An XML element is the basic building block of an XML document. Every XML document consists of XML
elements and the content embedded within these elements. A start tag and an end tag delimit an element
in an XML document. A start tag is delimited by the < and > characters and an end tag is delimited by
the </and > characters.
4
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
A document must have a single root element, which is the top most parent element in an XML document.
Note that a root element does not have a parent element, but can have one or more child elements. The
following code snippet shows a root element:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Employees>
.
.
.
</Employees>
In the preceding code snippet, <Employees> is a root element. You can add any number of elements in
the <Employees> element. The root element is also known as the document element.
Some rules that must be considered while defining an XML element are as follows:
Element names can start with letters or the underscore (_) character, but not numbers or other
punctuation characters.
After the first character, you can use numbers and characters, such as hyphen (-) and period (.), in
the element name.
Element names cannot contain spaces.
Element names cannot contain the: character because it is a reserved character in the XML
specification. You should avoid using it in your documents unless you are working with namespaces.
Element names cannot start with the words, such as xml, in uppercase, lowercase or mixed case.
There cannot be a space after the opening character (<); the name of the element must come
immediately after it. However, there can be space before the closing character (>).
In XML, elements are divided into two basic categories, empty and nested elements. Let’s learn about
these categories in the following sections.
An empty element does not contain any content or any other element within it. It can have attributes
that help in identifying an entity. An empty element can be written without an end tag. The following
code snippet shows how to define an empty element in an XML document:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<items>
<item itemid="2451" />
</items>
In the preceding code snippet, the item is an empty element.
Apart from content, an XML element can also contain other elements. The elements which contain other
elements are known as nested elements.
The following code snippet shows a root element containing a nested element having the content Ambrish:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Employees>
5
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
<FirstName>Ambrish</FirstName>
</Employees>
In the preceding code snippet, the <FirstName> element is nested in the <Employees> element. The
<FirstName> element contains the content Ambrish. In XML, the elements must be properly nested,
implying the parent element has to be opened before the child element and should be closed after the
child element.
XML attributes provide additional information about elements. An attribute is a name-value pair
contained in the element’s start tag, wherein names are separated from values by the assignment
operator. Values of attributes are enclosed either in single quotation marks or double quotation marks.
The following code snippet shows the use of attributes:
<person sex="male" color="dark brown" >
Alan Turing
</person>
In the preceding code snippet, sex and color are the two attributes of the person element. Male is the
value of the sex attribute and dark brown is the value of the color attribute.
An XML tree represents the elements of an XML document in a tree structure. The XML tree can contain
the following elements:
Root element: Contains all other elements and the content of an XML document. In an XML document,
there can be only one root element.
Parent element: Contains other inherited elements. A parent element can contain multiple child
elements.
Child element: Refers to an element that is contained in a parent element. A child element cannot
have multiple parent elements. It can only have a single parent.
Siblings: Refer to the elements that are contained in the parent element.
In an XML document, comments are used to specify some information or remarks. A comment is not
a part of the program; and therefore, it is not parsed by the parser. Similar to HTML, in XML also a
comment begins with <!- - and ends with - ->, as shown in the following code snippet:
<!--This is a Employee element-->
Comments can be written anywhere in the XML document, even before the root element. However, you
cannot write a comment inside a tag or another comment.
6
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
When uploading a file to the system, the user will utilise the Document Schema field to pick the schema
that will be associated with the document. The schema chosen will decide what metadata is necessary
as well as the format in which the metadata will be stored.
Document schemas should be tailored to your specific business operations. They are either linked to a
certain kind of documents, such as Executive, Compliance or Record or individual document schemas,
such as Well Reports, Minutes and Packet Attachments. When adding or checking a document into the
Library, a Publisher works with these document schemas. See User Roles for additional information on
Publisher security rights.
Providing detailed schema names and descriptions will improve the document management system’s
effectiveness. You cannot reuse a specific schema name after you have generated it, even if you have
erased the original schema from the system owing to File Hold’s retention characteristics. Providing
distinct names for each schema also avoids confusion for system administrators and end-users.
7
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Third parties will have new chances to offer value by integrating, converting, cleaning and aggregating
XML data. It’s not difficult to foresee that once it’s widespread, many information providers would
arrange their outward view as a repository of XML data, regardless of their internal storage systems.
The interchange of data between programs will thereafter be in XML format. So, what is the function
of a query language in this world? One may think of it as a local supplement to browsing, delivering a
more expressive “find” command over one or more retrieved documents. Alternatively, it might act as a
beefed-up version of XPointer, allowing for richer kinds of logical references to sections of documents.
Neither of these techniques of usage is really “database-like.” From the standpoint of a database, an
XML query language’s main purpose is as a tool for a structural and content-based query that allows an
application to extract precisely the information it requires from one or more XML data sources.
At this stage, we present features for an XML query language derived from its anticipated usage as a net
query language, as well as an explanation of why each is required:
Exact semantics: Formal semantics should be present in an XML query language. The formalisation
must be adequate to facilitate XML query reasoning, such as determining result structure,
equivalence and confinement. Query equivalence is required for query optimisation, but question
containment is important for semantic caching or assessing whether or not a push stream of data
may be utilised to answer a certain query.
Rewritability and optimisations: XML data is frequently created automatically from other forms,
such as relational, object-oriented and special-purpose data. As a result, such XML data will be a
perspective over data in other models. An XML query over that view should be translatable into the
original data’s query language, rather than having to transform the native data to XML format and
then perform a query. Alternatively, when XML data is native and processed by a query processor,
XML queries, like SQL queries over relational data, must be optimisable.
Operation of queries: An XML query language must allow the following operations: selection
(choosing a document element based on content, structure or attributes), extraction (drawing out
specific components of a document), reduction (removing selected sub-elements of an element),
8
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Combination and restructure (creating a new collection of element instances to contain queried
data) (merging two or more elements into one). All of these procedures should be possible with
a single XML query. These actions should not need the use of another language or several XML
queries. One cause is that an XML server may not comprehend the other language, forcing parts
of the intended output to be returned to the sender for full processing. Because some of these
processes significantly reduce data amounts, it is extremely desirable to do them on the server-side
to reduce network needs.
XML output: An XML query should return XML results. There are several advantages to having such
a closing feature. A single query can be used to construct derived databases (views). Composition
and deconstruction of queries are helped. Applications can’t tell if they’re looking at basic data or a
query result.
Semantics of composition: Referential transparency is required for expressions in the XML query
language. That is, the meaning of an expression should be consistent regardless of where it occurs.
Furthermore, expressions with the same result type should be permitted to exist in the same contexts.
Wherever an XML term is anticipated, an expression yielding an XML term should be permitted.
SQL, however, does not meet this latter condition, much to the cost of those who must implement
and use it.
No schema is necessary: When no schema (DTD) is known in advance, an XML query language
should be useable on XML data. Because XML data is inherently self-describing, an XML query
should be able to depend on such “just-in-time” schema knowledge in its evaluation. This means
that XML queries may be performed against an XML source even if the exact structure of its pages
is unknown.
Make use of available schema: When DTDs for a data source is given, it should be feasible to
determine if an XML query is appropriately structured in relation to the DTDs and to compute a DTD
for the result. This functionality detects issues at build time rather than run time and provides a
more user-friendly interface for programs to edit query results.
Maintain order and association: If necessary, XML queries should be able to retain the order and
relationship of items in XML data. The order of items in an XML document might include crucial
information, which a query should not lose. Similarly, the arrangement of sub-components inside
elements is frequently important. For instance, suppose an XML query extracts.
Embedding with XML mutually: XML queries and XML should be mutually embedded. That is, an
XML query should be capable of containing arbitrary XML data and an XML document should be
capable of including arbitrary inquiries. The latter feature enables XML documents to include both
stored and virtual data. The former enables an XML query to contain arbitrary constants and allows
for partial assessment of XML queries. Partial assessment is helpful in a distributed context when
data from one source is delivered to another and mixed with data from that source.
Support for new datatypes: An XML query language should have an extension mechanism for
data type-specific conditions and operations. One example is the use of specialised processes to pick
various types of multimedia information.
9
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Nonrelational Database
In nonrelational data-storage systems, there are numerous options for preserving XML data:
Store in flat files: Because XML is basically a file format, a flat file is a suitable storage technique.
Many of the disadvantages of using file systems as the foundation for database applications
apply to this technique. It lacks data separation, atomicity, concurrent access and security, among
other things. However, due to the widespread availability of XML tools that interact with file data,
accessing and querying XML data stored in files is quite simple. As a result, for some applications,
this storage format may suffice.
Create an XML database: Databases that employ XML as their primary information model are
known as XML databases. Initial, XML databases used a C++-based object-oriented database to
handle the DOM. This enables the utilisation of most of the object-oriented database structure despite
maintaining a conventional XML interface. Declarative querying is enabled with the inclusion of
10
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
XQuery or other XML programming languages. Other systems have developed the complete XML
storage and querying architecture on top of a transactional storage manager.
Relational Database
Because relational databases are commonly utilised in existing applications, there is indeed a
substantial advantage to storing XML data in relational databases such that current apps can access
the information.
If the data were collected from a relational model in the first place and XML is just seen as a data
communication medium for relational data, transforming XML data in to the relational structure is
usually simple.
However, there are several cases in which the XML data is not created from a relational model and it
may be difficult to convert the data to relational form for storage. The storing of XML data in relational
format is complicated by nested elements and elements that repeat.
Some options for storing XML data in relational database are as follows:
Store as string
Tree representation
Map to relations
Publishing and shredding XML data
Native storage within a relational database
SQL/XML
While XML is commonly utilised for information transfer, database systems are still widely utilised to
hold structured data. It’s common to need to transform relational data to XML format. To satisfy this
need, the SQL/XML principle provides a solid expansion of SQL that allows for the production of nested
XML output. A standard manner of mapping SQL types to XML Schema types, a common method of
mapping relational schemas to XML schemas and SQL query language extensions are all included in
the standard.
We’ll create an order report in the following sample XML application. This will entail some XML
processing, such as performing different XML operations (converting, parsing, validating, transforming
11
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
and publishing XML) on a variety of data sources. This page displays the order report XML application,
as shown in Figure 1:
The following stages are involved in the development of the XML application:
1. Creating a book catalogue from a text file.
2. Obtaining an Order from an EDIFACT file
3. Extracting order information with XQuery.
4. Using XSLT to generate an HTML order report
5. Creating an XSL: FO stylesheet with XQuery.
6. Using XSL: FO to generate a PDF order report.
12
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Data is information that has been turned into binary digital form in today’s computers and transmission
devices. The volume of data created has risen dramatically over the previous decade as the number of
computer users has expanded. As a result, a new phrase has been developed to describe the massive
amount of data that is being generated at an alarming rate. It’s referred to as big data. It is not just the
volume of data that has grown over time.
The kind of data being created is continuously rising in tandem with the amount. As a result, categorising
the sorts of data collected becomes critical. A massive quantity of data is created in this day and age of
the internet. This data can take the form of text, photographs, movies, documents, pdf files, videos, log
files and many more formats.
Let us now categorise this massive quantity of data into the main categories listed below:
Structured Information: Structured data is distinguished from semi-structured data. It is
information that has been purposefully constructed to be easily searchable, quantifiable and well-
organised. It is often kept in relational databases (RDBMS) and expressed in structured query
language (SQL); a standard language established by IBM in the 1970s for interfacing with databases.
Structured data can be entered by humans or robots, but it must comply with a tight framework
with established organisational qualities. Consider a hotel database that can be searched using
parameters such as guest name, phone number, room number and others. Excel files containing
data neatly arranged into rows and columns, for example.
Semi-structured Information: Semi-structured data is a sort of structured data that does not
adhere to the tabular structure associated with relational databases or other types of data tables
but includes tags or other markers to segregate semantic pieces and impose hierarchies of records
and fields within the data. As a result, it is also known as a self-descriptive structure.
Entities of the same class may have various features while being grouped close to each other in
semi-structured data and the order of the attributes is immaterial.
Semi-structured data has grown more frequent with the advent of the internet, as full-text texts and
databases are no longer the only forms of data. A medium for information sharing is required by
many applications and semi-structured data is frequent in object-oriented databases.
Emails, for example, are semi-structured by Sender, Recipient, Subject, Date and so on or are
automatically categorised into categories like Inbox, Spam, Promotions and so on using machine
learning.
Semi-structured data is a cross between images and videos: It may contain Meta tags referring
to the place, date or person who took them, for example, but the information contained inside
them is unstructured. Consider Facebook, which organises information by Users, Friends, Groups,
Marketplace and so on, but the comments and content inside these groupings are disorganised.
Because it has a little higher level of organisation, semi-structured data is easier to examine than
structured data. It must, however, be broken down using machine learning technologies before
it can be examined without human assistance. It also incorporates quantitative data, which, like
completely unstructured data, may give far more meaningful insights.
Unstructured Information: Unstructured data, on the other hand, is often open text, photos, videos
and other material with no preset arrangement or design. Consider internet reviews, documents and
13
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
other qualitative data sources on opinions and sentiments. This data is more difficult to examine,
but it can be organised to extract insights using machine learning techniques; however, it must first
be structured for computers to analyse it.
Info set, XQuery 1.0 and XSLT 2.0 are examples of abstract data models, as are XML Schema and declarative
processing languages such as XPath, XQuery and XSLT.) As a standard syntax for information, XML can
model the whole range of information, from completely organised data (i.e., bank account information)
to natural language. Having a single model for the whole spectrum of information provides enormous
advantages for modelling, storage, indexing and automatic processing. There is no need to go from
system to system and have incompatible systems communicate with one other when the amount of
structure in information is increased or decreased. Another significant benefit of XML is the ability to
represent heterogeneous content. Having an abstract information model that extends beyond the entity-
relationship model allows you to model a significant amount of data that was previously hard to model.
The fact that XML schemas are independent of data is also critical for data and schema evolution; data
can exist with or without schemas or with numerous schemas. After the data has been created, schemas
may be added; data and schema generation can be freely interleaved. While XML-based solutions
offer substantial benefits for handling semi-structured data, they are not a panacea in their current
form or isolation.
The majority of the material is still not in XML format and some of it will never be. The benefits of
XML (i.e., complex schemas, mixed context, schema independent data) inevitably add complexity and
problems. Finally, XML-related technologies do not now provide a full answer. While XSLT (Extensible
Stylesheet Language) and XQuery are strong query and transformation languages, there is still no
acceptable means of expressing imperative logic over such schema adaptable data or language to define
complicated integrity restrictions and assertions.
Such restrictions will be lifted eventually, but not immediately. A broad solution to the problem of semi
structured information will require concepts and techniques from a variety of domains, including
knowledge representation, XML and markup documents, information retrieval, the Semantic Web
and traditional data management approaches. There is no one-size-fits-all solution to this situation.
One aspect of the semi-structured data problem that search engine crawlers will help with is human
biological processing. Basic textual search results can be achieved by using situational, semantic and
structural information.
14
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Decoupling data from schemas has a significant influence on all areas of data processing, including
storage, indexing, querying and updating, transactional support and so on. To achieve performance for
such jobs, most existing approaches rely on static schema information. Such strategies must be revisited
to ensure their validity and performance, even in the absence of schema information or with continually
developing schema information (Daniela, 2005). Data has an uneven and incomplete structure in most
cases. Some sources contain an implicit data structure, making it difficult to determine the link between
data. Schema and data are often closely connected, which means they are not just related but also
dependent on one another. The same query may change both the schema and the data, with the schema
being updated on a regular basis. The line between schema and data is highly hazy or ambiguous.
This complicates the construction of data structure. When compared to structured data, storage
costs are significant.
<Employee>
<EmployeeFirstName>Ritesh</EmployeeFirstName>
<EmployeeLasttName>Sharma</EmployeeLasttName>
<EmployeeAddress>Hudson Lane, Delhi</EmployeeAddress>
<EmployeeDesignation>Sales Manager</EmployeeDesignation>
</Employee>
<Employee>
<EmployeeFirstName>Devesh</EmployeeFirstName>
<EmployeeLasttName>Gaud</EmployeeLasttName>
<EmployeeAddress>Laxmi Nagar, Delhi</EmployeeAddress>
<EmployeeDesignation>HR Manager</EmployeeDesignation>
</Employee>
</EmployeeData>
15
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
<xs:complexType>
<xs:sequence>
<xs:element name="CompanyName" type="xs:string"/>
<xs:element name="Employee" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="EmployeeFirstName" type="xs:string"/>
<xs:element name="EmployeeLasttName" type="xs:string"
minOccurs="0"/>
<xs:element name="EmployeeAddress" type="xs: string "/>
<xs:element name="EmployeeDesignation" type="xs: string
"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
16
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
A Web database is a database application that can be administered and accessed through the
Internet.
Semi-structured data is a hybrid of structured and unstructured data that combines the best of
both worlds.
Web interface is a method that allows you to interact with the program or material that you are
seeing through your web browser on a distant server.
Extensible Markup Language (XML) is an abbreviation for Extensible Markup Language. It is a text-
based markup language based on the Standard Generalised Markup Language (SGML).
XML defines certain rules for its syntax that specify how to create or structure an XML document.
The XML declaration statement is used to indicate that the specified document is an XML document.
An XML element is the basic building block of an XML document. Every XML document consists of
XML elements and the content embedded within these elements.
An attribute is a name-value pair contained in the element’s start tag, wherein names are separated
from values by the assignment operator.
An XML tree represents the elements of an XML document in a tree structure.
Document schemas are the metadata structure’s top level. They provide the Library Administrator
authority over the papers or files contributed to the library.
XML has the potential to go beyond its roots as a document markup language to become the standard
for data transfer on the Internet.
The storing of XML data is required by many applications. One method is to store XML data as
documents in a file system, while another is to create a special-purpose database for storing
XML data.
XML applications are programs that process and alter data using XML technologies such as XML,
XSLT, XQuery, XML Schema, XPath, Web services and so on.
The semi-structured data model is intended to be a development of the relational data model,
allowing data to be represented with a flexible structure.
Text indexing is a stage in the preparation of text retrieval.
14.9 GLOSSARY
Web database: A database application that can be administered and accessed through the Internet
Semi-structured data: A hybrid of structured and unstructured data that combines the best of
both worlds
Web interface: A method that allows you to interact with the program or material that you are
seeing through your web browser on a distant server
17
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Extensible Markup Language (XML): A text-based markup language based on the Standard
Generalised Markup Language (SGML)
Text indexing: A stage in the preparation of text retrieval
18
UNIT 14: Databases on the Web and Semi Structured Data JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
https://www.google.co.in/books/edition/Learning_XML/Zilck1_0c5QC?hl=en&gbpv=
1&dq=XML+book&printsec=frontcover
Discuss with friends and colleagues the concept of XML. Also, discuss the semi structured data
model.
19
UNIT
Names of Sub-Units
Advanced Database Models, Spatial Database, Deductive Database, Temporal Database, Multimedia
Database and Sensor Database System
Overview
The unit begins by discussing the concept of emerging database trends. Next, the unit discusses the
advanced database model and spatial database. Further, the unit explains the concept of deductive
database and temporal database. The unit also explains the multimedia database. Towards the end,
the unit discusses the sensor database system.
Learning Objectives
Learning Outcomes
http://cis.csuohio.edu/~sschung/IST331/Coronel_PPT_Ch05.pdf
15.1 INTRODUCTION
Computers have been around for a long time. Every day, new technology and projects arise. We must
keep an eye on upcoming trends to grasp present technology and have a better understanding of the
advancements around us. Almost every day, new technologies are introduced. Some of these fails and
fade away over time. Some of these innovative technologies thrive and survive over time, earning user
attention. Emerging trends are cutting-edge technologies that gain popularity and establish a new trend
among consumers. This unit will teach us about some developing trends that will have a significant
influence (in the future) on the digital economy and engagement in digital communities.
Advanced modelling can be used by the application developer to include components that are not built
into the programme by default. This can help in understanding the views, attributes, and authorisation
structure of the project.
2
UNIT 15: Emerging Database Trends JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
By giving database designers a specific framework to work from, data modelling speeds up their
work and reduces the likelihood that they will make mistakes that require modifications later on.
Effective data modelling allows businesses to make effective usage of existing data, which could also
lead to higher productivity, business growth and a competitive edge over opponents.
To express relational equal-joins, RDM uses the set concept, as shown in Figure 1:
Owner 4
Members
3
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
Many of the disadvantages of static data files are addressed by spatial databases. Spatial databases may
store a vast quantity of data in several tables using connecting techniques that ensure data integrity.
They can impose data input constraints to limit the acquisition of conflicting data. As they expand
in size, they can span numerous physical devices and keep duplicates of themselves for redundancy.
Geographical databases may also keep track of who is making modifications to a collection of spatial
data and track who is making them for permission and auditing. It is critical to distinguish between
traditional databases and spatial databases. Standard databases that have been enhanced to handle
geographic data types and queries are known as spatial databases. Geometric feature geometry that
represents form and position is stored in spatial data types.
The geometry of spatial features is compressed and saved in a binary field with the attribute data
that describes the feature. Furthermore, the database application code is enhanced so that ordinary
searches using alphanumeric characters and logical operators may make use of position, proximity,
and topology. In a normal customer database, for example, a corporation may query for all clients
whose last name begins with a specific letter. In a spatial database, users may search for all consumers
within a certain radius of a business or locate clusters of customers. These kinds of spatial interactions
aren’t seen in most databases.
Spatial databases existed long before the Web, but they were frequently concealed behind firewalls
or authentication, making them inaccessible to the majority of users. Spatial databases may become
targets for hackers since they are often located on powerful servers that are always linked to the
Internet. Malicious users may attempt to get unauthorised access to the database notwithstanding
user authentication and read-only access by exploiting flaws in the database or operating system
programming code. That was the case when SQL Slammer utilised a buffer overflow attack to seize
control of over 75,000 SQL servers, bringing the Internet to a halt (Microsoft, 2003).
Database options from traditional software providers like Microsoft, Oracle and IBM may all be utilised
for computing and GIS applications. Database software is sometimes too expensive to implement in
terms of both the cost of the program and the work required to implement it. Open-source software
has immensely helped GIS software, and spatial databases are no exception. Open-source software
provides various free and powerful geographic database choices for academics testing applications
on a shoestring budget. PostgreSQL, together with PostGIS, is the most widely used open-source spatial
database. A more adaptable “NoSQL” database is a new database trend.
These databases, sometimes known as “document databases,” store data in a manner similar to a JSON
file. These databases lack the schemas that ensure data integrity in traditional databases. Schemeless
databases are less fundamentally strict and offer more flexibility in terms of the data they can retain.
This flexibility enables the storing of disparate data but also necessitates that query be constructed
to deal with disparate data. MongoDB, the most popular schemeless open-source spatial database
platform, provides a growing range of spatial features.
4
UNIT 15: Emerging Database Trends JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
The spatial data in the form of points, lines, polygons, etc. is used by many different databases as shown
in Figure 2:
ORACLE
POINTS
SQL SERVER
SPATIAL
LINES
DATA
SYBASE
POLYGONS
DB2
Take a look at the following deductive database: “Bill’s father’s name is John.” “Derek’s father is Bill.” “IF
x is y’s father AND y’s father is z’s father, THEN x is z’s grandpa.”
In addition to the two facts, this database includes a rule that identifies a pair of entities’ grandfather
connections.
The database can verify the assertion “John is Derek’s grandpa” by substituting x = “John,” y = “Bill,” and
z = “Derek” into the rule. Deductive Database u This type of rule has the effect of defining new relations
that are not explicitly recorded in the database. These are referred to as implicit or virtual relations.
Base ties are explicit relations, such as the father-son relationship described above. The set of virtual
relations is referred to as the intentional database (IDB), while the set of base relations is referred to as
the extensional database First-order logic underpins deductive database systems. With first-order logic,
we can represent both facts and rules about facts using the same vocabulary (EDB).
One language for expressing logical propositions is first-order predicate calculus. For example, the
representation of the database above in first-order logic would be: father (John, Bill) father
(Bill, Derek) (x y z) (father (x, y) father (y, z) grandfather (x, z)) where , and represent
conjunction, implication and the quantifier “for all” respectively. A deductive database is an advanced
database that has been enhanced with an inference system. Deductive databases arose from the need
to merge logic programming with relational databases to build systems that support a sophisticated
formalism.
The immediate benefit is that it may lower the amount of space required to store data. In a deductive
database system, rules are often specified using a declarative language a language in which we express
what we want to do rather than how we want to achieve it. By interpreting these rules, an inference
5
JGI JAINDEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
engine (or deduction mechanism) within the system can deduce new facts from the database. Deductive
databases have not found widespread adoption outside academia, but some of their concepts are used
in today’s relational databases to support the advanced features of more recent SQL standards.
Information concerning changes in the real world throughout time is kept in temporal databases. A
database with built-in capabilities for managing time-related data is known as a temporal database. It
maintains data on all occurrences’ past, present and future times.
Temporal databases, as opposed to current databases (not to be confused with presently accessible
databases), hold only facts that are thought to be true at the moment.
Temporal databases facilitate the management and access to temporal data by including one or more
of the following features:
A time period data type, which includes the ability to describe time periods that have no end date
(infinity or forever).
The ability to establish valid and transactional time period properties, as well as bi-temporal
relationships.
Transaction time is kept by the system.
6
UNIT 15: Emerging Database Trends JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
The multimedia database contained multimedia data as well as information. The details are provided
as follows:
Media data: Refers to multimedia data, which includes photographs, movies, audios, animation, etc.
Media format data: The media format data includes structuring details for the media data, including
sampling rate, refresh rate and encoding technique, among others.
Media keyword data: This includes keyword information relevant to the database’s media content.
The keyword information for an image may include the image’s date, duration, summary, etc.
Media feature data: The media feature data provide a description of the multimedia data’s
characteristics. The colours and patterns in an image are examples of feature data for that image.
7
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
It is important to note that time is an important factor. Signal processing routines may return output
repeatedly throughout time, with each output having a timestamp. Furthermore, monitoring queries
impose limits on sensor data time-stamps; for example, Query 3 in Example 1 implies that the aberrant
temperatures are identified either immediately or within a specific time range. Aggregates across time
periods, such as Query 4 and 5, make explicit use of time.
We describe sensor data as a time series given these limitations. Our representation of sensor time series
is based on Seshadri et a sequence’s model. A sequence is described informally as a 3-tuple consisting of
a set of records R’s, a countable completely ordered domain O (referred to as the ordering domain the
components of the ordering domain are referred to as positions), and an ordering of R by O. An ordering
of a collection of records R’s by an ordering domain O is defined as a relationship between O and R in
which each record, in R is connected with someplace in O. Sequence operators are n-ary mappings on
sequences that operate on a certain number of input sequences to produce a single output sequence. All
sequence operators are computable.
8
UNIT 15: Emerging Database Trends JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Sequence operators include select, project, compose (natural join on the position) and aggregate more
than a set of positions. Due to capacity limitations, we advise the learners to read a formal explanation
of the sensor time - series data.
Emerging trends are cutting-edge technologies that gain popularity and establish a new trend
among consumers.
A Data Model is similar to a plan for a structure created by an engineer.
Advanced modelling can be used by the application developer to include components that are not
built into the programme by default.
The relational model, in which all data is specified in terms of tables and columns, is the most widely
known and utilised data model today.
A network model exists beneath the relational model in an RDM database, where all data is specified
in terms of record types and fields.
Spatial data is linked to geographic places such as cities, villages and so on.
A deductive database contains not just facts but also general principles, which may be used to derive
new facts that are not explicitly reflected in the database; data can therefore be stored implicitly.
A temporal database contains information about certain points in time.
The term “multimedia database” refers to a set of associated multimedia content that may comprise
text, graphics, photos, animations, video, audio and other types of data from numerous sources.
A sensor database contains both stored and sensor data. Stored data comprise the set of sensors
that participate in the sensor database, as well as sensor attributes (i.e, location) or physical
environment parameters.
9
JGI JAIN
DEEMED-TO-BE UNIVERSIT Y
Advanced Data Management Techniques
15.9 GLOSSARY
Relational model: All data is specified in terms of tables and columns, is the most widely known and
utilised data model today
Spatial database: It is designed to store and retrieve data that represent things. These are the items
that have a geometric definition
Deductive database: It contains not just facts but also general principles, which may be used to
derive new facts that are not explicitly reflected in the database; data can therefore be stored
implicitly
Transaction time: It refers to when a fact was recorded in the database.
Sensor database: It contains both stored and sensor data.
content that can be provided, saved and used in many ways. Refer to Section Multimedia Database
5. A sensor database contains both stored and sensor data. Stored data comprise the set of sensors
that participate in the sensor database, as well as sensor attributes (i.e, location) or physical
environment parameters. Refer to Section Sensor Database System
https://www.safe.com/blog/2021/11/7-spatial-databases-enterprise/
https://people.cs.pitt.edu/~chang/231/y08/y08sem/semkul/00621580.pdf
Discuss with your friends and colleagues the advanced data models. Also, discuss the concept of
spatial, deductive, temporal and multimedia database.
11