02 Normalization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

Computer Science Faculty

Information Systems Department

Data warehousing & BI


Abdul Rahman Safi
Rafiullah Momand
With Materials taken from Dr. Marcela Charfuelan, Dr. Ahsan Abdullah, Dr. Michael
Mannino, Dr.Jahangir Karimi
Normalization
Objectives
• Purpose of normalization
• Problems associated with redundant data
• Identification of various types of update anomalies such as insertion,
deletion, and modification anomalies
• How to recognize appropriateness or quality of the design of relations
• De-normalization, its techniques, and Issues

Information Systems Department 3


Objectives
• How functional dependencies can be used to group attributes into
relations that are in a known normal form?
• How to undertake process of normalization?
• How to identify most commonly used normal forms, namely 1NF, 2NF,
3NF, and Boyce–Codd normal form (BCNF)?
• How to identify fourth (4NF) and fifth (5NF) normal forms?
• What is De-normalization?
• What are the well known De-normalization techniques?

Information Systems Department 4


Normalization
• Main objective in developing a logical data model for relational
database systems is to create an accurate representation of the data,
its relationships, and constraints.
• To achieve this objective, must identify a suitable set of relations.
• Four most commonly used normal forms are first (1NF), second (2NF)
and third (3NF) normal forms, and Boyce–Codd normal form (BCNF).
• Based on functional dependencies among the attributes of a relation.
• A relation can be normalized to a specific form to prevent possible
occurrence of update anomalies.

Information Systems Department 5


Data Redundancy
• Major aim of relational database design is to group attributes into
relations to minimize data redundancy and reduce file storage space
required by base relations.

• Problems associated with data redundancy are illustrated by


comparing the following Staff and Branch relations with the
StaffBranch relation.

Information Systems Department 6


Data Redundancy and anomalies

Information Systems Department 7


Data Redundancy
• StaffBranch relation has redundant data: details of a branch are
repeated for every member of staff.

• In contrast, branch information appears only once for each branch in


Branch relation and only branchNo is repeated in Staff relation, to
represent where each member of staff works.

Information Systems Department 8


Update Anomalies
• Relations that contain redundant information may potentially suffer
from update anomalies.
• Types of update anomalies include:
• Insertion
• Deletion
• Modification.

Information Systems Department 9


Lossless-join and Dependency Preservation Properties

• Two important properties of decomposition:


- Lossless-join property enables us to find any instance of original
relation from corresponding instances in the smaller relations.
- Dependency preservation property enables us to enforce a
constraint on original relation by enforcing some constraint on each
of the smaller relations.

Information Systems Department 10


Functional Dependency
• Main concept associated with normalization.
• Functional Dependency
• Describes relationship between attributes in a relation.
• If A and B are attributes of relation R, B is functionally dependent on A
(denoted A B), if each value of A in R is associated with exactly one value of
B in R.

Information Systems Department 11


Functional Dependency
• Property of the meaning (or semantics) of the attributes in a
relation.
• Diagrammatic representation:

 Determinant of a functional dependency refers to


attribute or group of attributes on left-hand side
of the arrow.

Information Systems Department 12


Example - Functional Dependency

Information Systems Department 13


Functional Dependency
• Main characteristics of functional dependencies used in
normalization:
• have a 1:1 relationship between attribute(s) on left and right-hand side of a
dependency;
• hold for all time;
• are nontrivial.

Information Systems Department 14


Functional Dependency
• Complete set of functional dependencies for a given relation can be
very large.
• Important to find an approach that can reduce set to a manageable
size.
• Need to identify set of functional dependencies (X) for a relation that
is smaller than complete set of functional dependencies (Y) for that
relation and has property that every functional dependency in Y is
implied by functional dependencies in X.

Information Systems Department 15


Functional Dependency
• Set of all functional dependencies implied by a given set of functional
dependencies X called closure of X (written X+).

• Set of inference rules, called Armstrong’s axioms, specifies how new


functional dependencies can be inferred from given ones.

Information Systems Department 16


Functional Dependency
• Let A, B, and C be subsets of the attributes of relation R. Armstrong’s
axioms are as follows:
(1) Reflexivity
If B is a subset of A, then A → B
(2) Augmentation
If A → B, then A,C → B,C
(3) Transitivity
If A → B and B → C, then A → C

Information Systems Department 17


Functional Dependency
(4) Self-determination:
A→A
(5) Decomposition:
If A → B,C, then A → B and A → C
(6) Union:
If A → B and A → C, then A → B,C
(7) Composition:
If A → B and C → D then A,C → B,D

Information Systems Department 18


The Process of Normalization
• Formal technique for analyzing a relation based on its primary key
and functional dependencies between its attributes.

• Often executed as a series of steps. Each step corresponds to a


specific normal form, which has known properties.

• As normalization proceeds, relations become progressively more


restricted (stronger) in format and also less vulnerable to update
anomalies.

Information Systems Department 19


Relationship Between Normal Forms

Information Systems Department 20


Un-normalized Form (UNF)
• A table that contains one or more repeating groups.

• To create an un-normalized table:


• transform data from information source (e.g. form) into table format with
columns and rows.

Information Systems Department 21


First Normal Form (1NF)
• A relation in which intersection of each row and column contains one
and only one value.

Information Systems Department 22


UNF to 1NF
• Nominate an attribute or group of attributes to act as the key for the
unnormalized table.

• Identify repeating group(s) in unnormalized table which repeats for


the key attribute(s).

Information Systems Department 23


UNF to 1NF
• Remove repeating group by:
• entering appropriate data into the empty columns of rows containing
repeating data (‘flattening’ the table).
Or by
• placing repeating data along with copy of the original key attribute(s) into a
separate relation.

Information Systems Department 24


Second Normal Form (2NF)
• Based on concept of full functional dependency:
• A and B are attributes of a relation,
• B is fully dependent on A if B is functionally dependent on A but not on any
proper subset of A.

• 2NF - A relation that is in 1NF and every non-primary-key attribute is


fully functionally dependent on the primary key.

Information Systems Department 25


1NF to 2NF
• Identify primary key for the 1NF relation.

• Identify functional dependencies in the relation.

• If partial dependencies exist on the primary key remove them by


placing them in a new relation along with copy of their determinant.

Information Systems Department 26


Third Normal Form (3NF)
• Based on concept of transitive dependency:
• A, B and C are attributes of a relation such that if A → B and B → C,
• then C is transitively dependent on A through B. (Provided that A is not
functionally dependent on B or C).
• 3NF - A relation that is in 1NF and 2NF and in which no non-primary-
key attribute is transitively dependent on the primary key.

Information Systems Department 27


2NF to 3NF
• Identify the primary key in the 2NF relation.

• Identify functional dependencies in the relation.

• If transitive dependencies exist on the primary key remove them by


placing them in a new relation along with copy of their determinant.

Information Systems Department 28


General Definitions of 2NF and 3NF
• Second normal form (2NF)
• A relation that is in 1NF and every non-primary-key attribute is fully
functionally dependent on any candidate key.

• Third normal form (3NF)


• A relation that is in 1NF and 2NF and in which no non-primary-key attribute is
transitively dependent on any candidate key.

Information Systems Department 29


Boyce–Codd Normal Form (BCNF)
• Based on functional dependencies that take into account all
candidate keys in a relation, however BCNF also has additional
constraints compared with general definition of 3NF.

• BCNF - A relation is in BCNF if and only if every determinant is a


candidate key.

Information Systems Department 30


Boyce–Codd normal form (BCNF)
• Difference between 3NF and BCNF is that for a functional dependency
A  B, 3NF allows this dependency in a relation if B is a primary-key
attribute and A is not a candidate key.

• Whereas, BCNF insists that for this dependency to remain in a


relation, A must be a candidate key.

• Every relation in BCNF is also in 3NF. However, relation in 3NF may


not be in BCNF.

Information Systems Department 31


Boyce–Codd normal form (BCNF)
• Violation of BCNF is quite rare.

• Potential to violate BCNF may occur in a relation that:


• contains two (or more) composite candidate keys;
• the candidate keys overlap (ie. have at least one attribute in common).

Information Systems Department 32


Review of Normalization (UNF to BCNF)

Information Systems Department 33


Review of Normalization (UNF to BCNF)

Information Systems Department 34


Review of Normalization (UNF to BCNF)

Information Systems Department 35


Review of Normalization (UNF to BCNF)

Information Systems Department 36


Fourth Normal Form (4NF)
• Although BCNF removes anomalies due to functional dependencies,
another type of dependency called a multi-valued dependency (MVD)
can also cause data redundancy.

• Possible existence of MVDs in a relation is due to 1NF and can result


in data redundancy.

Information Systems Department 37


Fourth Normal Form (4NF) - MVD
• Dependency between attributes (for example, A, B, and C) in a
relation, such that for each value of A there is a set of values for B and
a set of values for C. However, set of values for B and C are
independent of each other.

Information Systems Department 38


Fourth Normal Form (4NF)
• MVD between attributes A, B, and C in a relation using the following
notation:
A -->> B
A -->> C

Information Systems Department 39


Fourth Normal Form (4NF)
• MVD can be further defined as being trivial or nontrivial.
• MVD A -->> B in relation R is defined as being trivial if (a) B is a subset of A or (b)
A  B = R.
• MVD is defined as being nontrivial if neither (a) nor (b) are satisfied.
• Trivial MVD does not specify a constraint on a relation, while a nontrivial MVD
does specify a constraint.

Information Systems Department 40


Fourth Normal Form (4NF)
• Defined as a relation that is in BCNF and contains no nontrivial MVDs.
• 4NF Example:

Information Systems Department 41


Fifth Normal Form (5NF)
• A relation decomposed into two relations must have lossless-join
property, which ensures that no spurious tuples are generated when
relations are reunited through a natural join.

• However, there are requirements to decompose a relation into more


than two relations.

• Although rare, these cases are managed by join dependency and fifth
normal form (5NF).

Information Systems Department 42


Fifth Normal Form (5NF)
• A relation that has no join dependency.

Information Systems Department 43


5NF - Example

Information Systems Department 44


Other Normal Forms
• There are other NFs that are more of academia interest including:
• 6th NF
• EKNF
• RFNF
• SKNF
• DKNF

Information Systems Department 45


Normalization Exercise

Information Systems Department 46


De-Normalization
Striking Between “Good” and “Evil”
De-normalization Normalization
Too many tables
4+ Normal Forms

3rd Normal Form

2nd Normal Form

Data Cubes 1st Normal Form

Data Lists

Flat Table One big flat file

Information Systems Department 48


What is De-normalization?
• It is not chaos, more like a “controlled crash” with the aim of
performance enhancement without loss of information.

• Normalization is a rule of thumb in DBMS, but in DSS ease of use is


achieved by way of de-normalization.

• De-normalization comes in many flavors, such as combining tables,


splitting tables, adding data etc., but all done very carefully.

Information Systems Department 49


Why De-normalization In DSS?
• Bringing “close” dispersed but related data items.

• Query performance in DSS significantly dependent on physical data


model.

• Very early studies showed performance difference in orders of


magnitude for different number de-normalized tables and rows per
table.

• The level of de-normalization should be carefully considered.


Information Systems Department 50
How De-normalization improves
performance?
• De-normalization specifically improves performance by
either:
• Reducing the number of tables and hence the reliance on joins, which
consequently speeds up performance.
• Reducing the number of joins required during query execution, or
• Reducing the number of rows to be retrieved from the Primary Data Table.

Information Systems Department 51


4 Guidelines for De-normalization
• 1. Carefully do a cost-benefit analysis (frequency of use, additional
storage, join time).

• 2. Do a data requirement and storage analysis.

• 3. Weigh against the maintenance issue of the redundant data


(triggers used).

• 4. When in doubt, don’t de-normalize.

Information Systems Department 52


Areas for Applying De-Normalization
Techniques
• Dealing with the abundance of star schemas.
• Fast access of time series data for analysis.
• Fast aggregate (sum, average etc.) results and complicated calculations.
• Multidimensional analysis (e.g. geography) in a complex hierarchy.
• Dealing with few updates but many join queries.

De-normalization will ultimately affect the database


size and query performance.

Information Systems Department 53


Five principal De-normalization techniques
• Collapsing Tables
• Two entities with a One-to-One relationship
• Two entities with a Many-to-Many relationship
• Splitting Tables (Horizontal/Vertical Splitting)
• Pre-Joining
• Adding Redundant Columns (Reference Data)
• Derived Attributes (Summary, Total, Balance etc…)

Information Systems Department 54


Collapsing Tables
ColA ColB
denormalized

ColA ColB ColC


normalized

ColA ColC

 Reduced storage space.


 Reduced update time.
 Does not changes business view.
 Reduced foreign keys.
 Reduced indexing.

Information Systems Department 55


Splitting Tables
Table
Table_v1 Table_v2
ColA ColB ColC
ColA ColB ColA ColC

Vertical Split
Table_h1 Table_h2
ColA ColB ColC ColA ColB ColC

Horizontal split
Information Systems Department 56
Splitting Tables: Horizontal splitting
• Breaks a table into multiple tables based upon common column
values. Example: Campus specific queries.
• GOAL
• Spreading rows for exploiting parallelism.
• Grouping data to avoid unnecessary query load in WHERE clause.

Information Systems Department 57


Splitting Tables: Horizontal splitting
• Advantages
• Enhance security of data.
• Organizing tables differently for different queries.
• Graceful degradation of database in case of table damage.
• Fewer rows result in flatter B-trees and fast data retrieval.

Information Systems Department 58


Splitting Tables: Vertical Splitting
• Infrequently accessed columns become extra “baggage” thus
degrading performance.
• Very useful for rarely accessed large text columns with large headers.
• Header size is reduced, allowing more rows per block, thus reducing
I/O.
• Splitting and distributing into separate files with repeating primary
key.
• For an end user, the split appears as a single table through a view.

Information Systems Department 59


Pre-joining
• Identify frequent joins and append the tables together in the physical
data model.

• Generally used for 1:M such as master-detail. RI is assumed to exist.

• Additional space is required as the master information is repeated in


the new header table.

Information Systems Department 60


Pre-Joining Sale_ID Sale_date Sale_person

normalized
1 M
Tx_ID Sale_ID Item_ID Item_Qty Sale_Rs Detail

Tx_ID Sale_ID Sale_date Sale_person Item_ID Item_Qty Sale_Rs


denormalized

Information Systems Department 61


Adding Redundant Columns
Table_1 Table_1’
ColA ColB ColA ColB ColC

Table_2 Table_2
ColA ColC ColD … ColZ ColA ColC ColD … ColZ

Information Systems Department 63


Adding Redundant Columns
• Columns can also be moved, instead of making them redundant. Very
similar to pre-joining as discussed earlier.
• Example:
• Frequent referencing of code in one table and corresponding description in
another table.
• A join is required.
• To eliminate the join, a redundant attribute added in the target entity which
is functionally independent of the primary key.

Information Systems Department 64


Redundant Columns: Surprise
Note that:
• Actually increases in storage space, and increase in update overhead.
• Keeping the actual table intact and unchanged helps enforce RI
constraint.
• Age old debate of RI ON or OFF.

Information Systems Department 65


Derived Attributes: Example
DWH Data Model
Business Data Model
#SID #SID
DoB DoB
Degree Degree
Course Course
Grade Grade
Credits Credits Derived attributes
GP  Calculated once
DoB: Date of Birth Age  Used Frequently

• Age is also a derived attribute, calculated as Current_Date – DoB (calculated


periodically).
• GP (Grade Point) column in the data warehouse data model is included as a derived
value. The formula for calculating this field is Grade*Credits.
Information Systems Department 66
Issues of De-Normalization
Issues of De-normalization
• Storage
• Performance
• Ease-of-use
• Maintenance

Information Systems Department 68


Industry Characteristics
Master:Detail Ratios
• Health care 1:2 ratio
• Video Rental 1:3 ratio
• Retail 1:30 ratio

Information Systems Department 69


Storage Issues: Pre-joining Facts
• Assume 1:2 record count ratio between claim master and detail for
health-care application
• Assume 10 million members, 20 million records in claim detail
• Assume 10 byte member_ID
• Assume 40 byte header for master and 60 byte header for detail
tables

Information Systems Department 70


Storage Issues: Pre-joining (Calculations)
With Normalization:
Total space used = 10 x 40 + 20 x 60 = 1.6 GB

After De-normalization:
Total space used = (60 + 40 – 10) x 20 = 1.8 GB

Net result is 12.5% additional space required in raw data table size for
the database.

Information Systems Department 71


Performance Issues: Pre-joining
Consider the query “How many members were paid claims during
last year?”
With Normalization:
Simply count the number of records in the master table.
After De-normalization:
The member_ID would be repeated, hence need a count distinct.
This will cause sorting on a larger table and degraded performance.

Information Systems Department 72


Why Performance Issues: Pre-joining
Depending on the query, the performance actually deteriorates with
de-normalization! This is due to the following three reasons:

• Forcing a sort due to count distinct.


• Using a table with 1.5 times header size.
• Using a table which is 2 times larger.
• Resulting in 3 times degradation in performance.

Bottom Line: Other than 0.2 GB additional space, also keep the 0.4
GB master table.

Information Systems Department 73


Performance Issues: Adding redundant
columns
Continuing with the previous Health-Care example, assuming a 60
byte detail table and 10 byte Sale_Person.
• Copying the Sale_Person to the detail table results in all scans taking 16%
longer than previously.
• Justifiable only if significant portion of queries get benefit by accessing the
de-normalized detail table.
• Need to look at the cost-benefit trade-off for each denormalization decision.

Information Systems Department 74


Other Issues: Adding redundant columns
Other issues include, increase in table size, maintenance and loss
of information:
• The size of the (largest table i.e.) transaction table increases by the size of the
Sale_Person key.
• For the example being considered, the detail table size increases from 1.2
GB to 1.32 GB.
• If the Sale_Person key changes (e.g. new 12 digit NID), then updates to be
reflected all the way to transaction table.
• In the absence of 1:M relationship, column movement will actually result in
loss of data.

Information Systems Department 75


Ease of use Issues: Horizontal Splitting
Horizontal splitting is a Divide & Conquer technique that exploits
parallelism. The conquer part of the technique is about combining the
results.
Lets see how it works for hash based splitting/partitioning.
• Assuming uniform hashing, hash splitting supports even data
distribution across all partitions in a pre-defined manner.
• However, hash based splitting is not easily reversible to eliminate the
split.

Information Systems Department 76


Ease of use Issues: Horizontal Splitting

Information Systems Department 77


Ease of use Issues: Horizontal Splitting

• Round robin and random splitting:


• Guarantee good data distribution.
• Almost impossible to reverse (or undo).
• Not pre-defined.

Information Systems Department 78


Ease of use Issues: Horizontal Splitting

• Range and expression splitting:


• Can facilitate partition elimination with a smart
optimizer.
• Generally lead to "hot spots” (uneven distribution
of data).

Information Systems Department 79


Performance Issues: Horizontal Splitting

Dramatic cancellation of
airline reservations after
9/11, resulting in “hot
Processors spot”

P1 P2 P3 P4

1998 1999 2000 2001


Splitting based on year

Information Systems Department 80


Performance issues: Vertical Splitting Facts
Example: Consider a 100 byte header for the member table such that
20 bytes provide complete coverage for 90% of the queries.

Split the member table into two parts as follows:


1. Frequently accessed portion of table (20 bytes), and
2. Infrequently accessed portion of table (80+ bytes). Why 80+?
Note that primary key (member_id) must be present in both tables for
eliminating the split.

Information Systems Department 81


Performance issues: Vertical Splitting Good
vs. Bad
• Scanning the claim table for most frequently used queries will be
500% faster with vertical splitting
• Ironically, for the “infrequently” accessed queries the performance
will be inferior as compared to the un-split table because of the join
overhead.

Information Systems Department 82


Summary
• Normalization and its different forms
• De-normalization techniques
• Trade-off between different factors when considering Normalization
and de-normalization.

Information Systems Department 83

You might also like