02 Normalization

Computer Science Faculty

Information Systems Department

Data warehousing & BI

Abdul Rahman Safi
Rafiullah Momand
With Materials taken from Dr. Marcela Charfuelan, Dr. Ahsan Abdullah, Dr. Michael
Mannino, Dr.Jahangir Karimi
Normalization
Objectives
• Purpose of normalization
• Problems associated with redundant data
• Identification of various types of update anomalies such as insertion,
deletion, and modification anomalies
• How to recognize appropriateness or quality of the design of relations
• De-normalization, its techniques, and Issues
Information Systems Department 3

Objectives
• How functional dependencies can be used to group attributes into
relations that are in a known normal form?
• How to undertake process of normalization?
• How to identify most commonly used normal forms, namely 1NF, 2NF,
3NF, and Boyce–Codd normal form (BCNF)?
• How to identify fourth (4NF) and fifth (5NF) normal forms?
• What is De-normalization?
• What are the well known De-normalization techniques?

Normalization
• Main objective in developing a logical data model for relational
database systems is to create an accurate representation of the data,
its relationships, and constraints.
• To achieve this objective, must identify a suitable set of relations.
• Four most commonly used normal forms are first (1NF), second (2NF)
and third (3NF) normal forms, and Boyce–Codd normal form (BCNF).
• Based on functional dependencies among the attributes of a relation.
• A relation can be normalized to a specific form to prevent possible
occurrence of update anomalies.

Data Redundancy
• Major aim of relational database design is to group attributes into
relations to minimize data redundancy and reduce file storage space
required by base relations.
• Problems associated with data redundancy are illustrated by

comparing the following Staff and Branch relations with the
StaffBranch relation.

Data Redundancy and anomalies

Data Redundancy
• StaffBranch relation has redundant data: details of a branch are
repeated for every member of staff.
• In contrast, branch information appears only once for each branch in

Branch relation and only branchNo is repeated in Staff relation, to
represent where each member of staff works.

Update Anomalies
• Relations that contain redundant information may potentially suffer
from update anomalies.
• Types of update anomalies include:
• Insertion
• Deletion
• Modification.

Lossless-join and Dependency Preservation Properties
• Two important properties of decomposition:

- Lossless-join property enables us to find any instance of original
relation from corresponding instances in the smaller relations.
- Dependency preservation property enables us to enforce a
constraint on original relation by enforcing some constraint on each
of the smaller relations.

Functional Dependency
• Main concept associated with normalization.
• Functional Dependency
• Describes relationship between attributes in a relation.
• If A and B are attributes of relation R, B is functionally dependent on A
(denoted A B), if each value of A in R is associated with exactly one value of
B in R.

• Property of the meaning (or semantics) of the attributes in a
relation.
• Diagrammatic representation:
 Determinant of a functional dependency refers to

attribute or group of attributes on left-hand side
of the arrow.

Example - Functional Dependency

• Main characteristics of functional dependencies used in
normalization:
• have a 1:1 relationship between attribute(s) on left and right-hand side of a
dependency;
• hold for all time;
• are nontrivial.

• Complete set of functional dependencies for a given relation can be
very large.
• Important to find an approach that can reduce set to a manageable
size.
• Need to identify set of functional dependencies (X) for a relation that
is smaller than complete set of functional dependencies (Y) for that
relation and has property that every functional dependency in Y is
implied by functional dependencies in X.

• Set of all functional dependencies implied by a given set of functional
dependencies X called closure of X (written X+).
• Set of inference rules, called Armstrong’s axioms, specifies how new

functional dependencies can be inferred from given ones.

• Let A, B, and C be subsets of the attributes of relation R. Armstrong’s
axioms are as follows:
(1) Reflexivity
If B is a subset of A, then A → B
(2) Augmentation
If A → B, then A,C → B,C
(3) Transitivity
If A → B and B → C, then A → C

(4) Self-determination:
A→A
(5) Decomposition:
If A → B,C, then A → B and A → C
(6) Union:
If A → B and A → C, then A → B,C
(7) Composition:
If A → B and C → D then A,C → B,D

The Process of Normalization
• Formal technique for analyzing a relation based on its primary key
and functional dependencies between its attributes.
• Often executed as a series of steps. Each step corresponds to a

specific normal form, which has known properties.
• As normalization proceeds, relations become progressively more

restricted (stronger) in format and also less vulnerable to update
anomalies.

Relationship Between Normal Forms

Un-normalized Form (UNF)
• A table that contains one or more repeating groups.
• To create an un-normalized table:

• transform data from information source (e.g. form) into table format with
columns and rows.

First Normal Form (1NF)
• A relation in which intersection of each row and column contains one
and only one value.

UNF to 1NF
• Nominate an attribute or group of attributes to act as the key for the
unnormalized table.
• Identify repeating group(s) in unnormalized table which repeats for

the key attribute(s).

UNF to 1NF
• Remove repeating group by:
• entering appropriate data into the empty columns of rows containing
repeating data (‘flattening’ the table).
Or by
• placing repeating data along with copy of the original key attribute(s) into a
separate relation.

Second Normal Form (2NF)
• Based on concept of full functional dependency:
• A and B are attributes of a relation,
• B is fully dependent on A if B is functionally dependent on A but not on any
proper subset of A.
• 2NF - A relation that is in 1NF and every non-primary-key attribute is

fully functionally dependent on the primary key.

1NF to 2NF
• Identify primary key for the 1NF relation.
• Identify functional dependencies in the relation.
• If partial dependencies exist on the primary key remove them by

placing them in a new relation along with copy of their determinant.

Third Normal Form (3NF)
• Based on concept of transitive dependency:
• A, B and C are attributes of a relation such that if A → B and B → C,
• then C is transitively dependent on A through B. (Provided that A is not
functionally dependent on B or C).
• 3NF - A relation that is in 1NF and 2NF and in which no non-primary-
key attribute is transitively dependent on the primary key.

2NF to 3NF
• Identify the primary key in the 2NF relation.
• Identify functional dependencies in the relation.
• If transitive dependencies exist on the primary key remove them by

placing them in a new relation along with copy of their determinant.

General Definitions of 2NF and 3NF
• Second normal form (2NF)
• A relation that is in 1NF and every non-primary-key attribute is fully
functionally dependent on any candidate key.
• Third normal form (3NF)

• A relation that is in 1NF and 2NF and in which no non-primary-key attribute is
transitively dependent on any candidate key.

Boyce–Codd Normal Form (BCNF)
• Based on functional dependencies that take into account all
candidate keys in a relation, however BCNF also has additional
constraints compared with general definition of 3NF.
• BCNF - A relation is in BCNF if and only if every determinant is a

candidate key.

Boyce–Codd normal form (BCNF)
• Difference between 3NF and BCNF is that for a functional dependency
A  B, 3NF allows this dependency in a relation if B is a primary-key
attribute and A is not a candidate key.
• Whereas, BCNF insists that for this dependency to remain in a

relation, A must be a candidate key.
• Every relation in BCNF is also in 3NF. However, relation in 3NF may

not be in BCNF.

Boyce–Codd normal form (BCNF)
• Violation of BCNF is quite rare.
• Potential to violate BCNF may occur in a relation that:

• contains two (or more) composite candidate keys;
• the candidate keys overlap (ie. have at least one attribute in common).

Review of Normalization (UNF to BCNF)




Fourth Normal Form (4NF)
• Although BCNF removes anomalies due to functional dependencies,
another type of dependency called a multi-valued dependency (MVD)
can also cause data redundancy.
• Possible existence of MVDs in a relation is due to 1NF and can result

in data redundancy.

Fourth Normal Form (4NF) - MVD
• Dependency between attributes (for example, A, B, and C) in a
relation, such that for each value of A there is a set of values for B and
a set of values for C. However, set of values for B and C are
independent of each other.

• MVD between attributes A, B, and C in a relation using the following
notation:
A -->> B
A -->> C

• MVD can be further defined as being trivial or nontrivial.
• MVD A -->> B in relation R is defined as being trivial if (a) B is a subset of A or (b)
A  B = R.
• MVD is defined as being nontrivial if neither (a) nor (b) are satisfied.
• Trivial MVD does not specify a constraint on a relation, while a nontrivial MVD
does specify a constraint.

• Defined as a relation that is in BCNF and contains no nontrivial MVDs.
• 4NF Example:

Fifth Normal Form (5NF)
• A relation decomposed into two relations must have lossless-join
property, which ensures that no spurious tuples are generated when
relations are reunited through a natural join.
• However, there are requirements to decompose a relation into more

than two relations.
• Although rare, these cases are managed by join dependency and fifth
normal form (5NF).

Fifth Normal Form (5NF)
• A relation that has no join dependency.

5NF - Example

Other Normal Forms
• There are other NFs that are more of academia interest including:
• 6th NF
• EKNF
• RFNF
• SKNF
• DKNF

Normalization Exercise

De-Normalization
Striking Between “Good” and “Evil”
De-normalization Normalization
Too many tables
4+ Normal Forms
3rd Normal Form
2nd Normal Form
Data Cubes 1st Normal Form
Data Lists
Flat Table One big flat file

What is De-normalization?
• It is not chaos, more like a “controlled crash” with the aim of
performance enhancement without loss of information.
• Normalization is a rule of thumb in DBMS, but in DSS ease of use is

achieved by way of de-normalization.
• De-normalization comes in many flavors, such as combining tables,

splitting tables, adding data etc., but all done very carefully.

Why De-normalization In DSS?
• Bringing “close” dispersed but related data items.
• Query performance in DSS significantly dependent on physical data

model.
• Very early studies showed performance difference in orders of

magnitude for different number de-normalized tables and rows per
table.
• The level of de-normalization should be carefully considered.

How De-normalization improves
performance?
• De-normalization specifically improves performance by
either:
• Reducing the number of tables and hence the reliance on joins, which
consequently speeds up performance.
• Reducing the number of joins required during query execution, or
• Reducing the number of rows to be retrieved from the Primary Data Table.

4 Guidelines for De-normalization
• 1. Carefully do a cost-benefit analysis (frequency of use, additional
storage, join time).
• 2. Do a data requirement and storage analysis.
• 3. Weigh against the maintenance issue of the redundant data

(triggers used).
• 4. When in doubt, don’t de-normalize.

Areas for Applying De-Normalization
Techniques
• Dealing with the abundance of star schemas.
• Fast access of time series data for analysis.
• Fast aggregate (sum, average etc.) results and complicated calculations.
• Multidimensional analysis (e.g. geography) in a complex hierarchy.
• Dealing with few updates but many join queries.
De-normalization will ultimately affect the database

size and query performance.

Five principal De-normalization techniques
• Collapsing Tables
• Two entities with a One-to-One relationship
• Two entities with a Many-to-Many relationship
• Splitting Tables (Horizontal/Vertical Splitting)
• Pre-Joining
• Adding Redundant Columns (Reference Data)
• Derived Attributes (Summary, Total, Balance etc…)

Collapsing Tables
ColA ColB
denormalized
ColA ColB ColC

normalized
ColA ColC
 Reduced storage space.

 Reduced update time.
 Does not changes business view.
 Reduced foreign keys.
 Reduced indexing.

Splitting Tables
Table
Table_v1 Table_v2
ColA ColB ColC
ColA ColB ColA ColC
Vertical Split
Table_h1 Table_h2
ColA ColB ColC ColA ColB ColC
Horizontal split
Splitting Tables: Horizontal splitting
• Breaks a table into multiple tables based upon common column
values. Example: Campus specific queries.
• GOAL
• Spreading rows for exploiting parallelism.
• Grouping data to avoid unnecessary query load in WHERE clause.

Splitting Tables: Horizontal splitting
• Advantages
• Enhance security of data.
• Organizing tables differently for different queries.
• Graceful degradation of database in case of table damage.
• Fewer rows result in flatter B-trees and fast data retrieval.

Splitting Tables: Vertical Splitting
• Infrequently accessed columns become extra “baggage” thus
degrading performance.
• Very useful for rarely accessed large text columns with large headers.
• Header size is reduced, allowing more rows per block, thus reducing
I/O.
• Splitting and distributing into separate files with repeating primary
key.
• For an end user, the split appears as a single table through a view.

Pre-joining
• Identify frequent joins and append the tables together in the physical
data model.
• Generally used for 1:M such as master-detail. RI is assumed to exist.
• Additional space is required as the master information is repeated in

the new header table.

Pre-Joining Sale_ID Sale_date Sale_person
normalized
1 M
Tx_ID Sale_ID Item_ID Item_Qty Sale_Rs Detail
Tx_ID Sale_ID Sale_date Sale_person Item_ID Item_Qty Sale_Rs

denormalized

Adding Redundant Columns
Table_1 Table_1’
ColA ColB ColA ColB ColC
Table_2 Table_2
ColA ColC ColD … ColZ ColA ColC ColD … ColZ

Adding Redundant Columns
• Columns can also be moved, instead of making them redundant. Very
similar to pre-joining as discussed earlier.
• Example:
• Frequent referencing of code in one table and corresponding description in
another table.
• A join is required.
• To eliminate the join, a redundant attribute added in the target entity which
is functionally independent of the primary key.

Redundant Columns: Surprise
Note that:
• Actually increases in storage space, and increase in update overhead.
• Keeping the actual table intact and unchanged helps enforce RI
constraint.
• Age old debate of RI ON or OFF.

Derived Attributes: Example
DWH Data Model
Business Data Model
#SID #SID
DoB DoB
Degree Degree
Course Course
Grade Grade
Credits Credits Derived attributes
GP  Calculated once
DoB: Date of Birth Age  Used Frequently
• Age is also a derived attribute, calculated as Current_Date – DoB (calculated

periodically).
• GP (Grade Point) column in the data warehouse data model is included as a derived
value. The formula for calculating this field is Grade*Credits.
Issues of De-Normalization
Issues of De-normalization
• Storage
• Performance
• Ease-of-use
• Maintenance

Industry Characteristics
Master:Detail Ratios
• Health care 1:2 ratio
• Video Rental 1:3 ratio
• Retail 1:30 ratio

Storage Issues: Pre-joining Facts
• Assume 1:2 record count ratio between claim master and detail for
health-care application
• Assume 10 million members, 20 million records in claim detail
• Assume 10 byte member_ID
• Assume 40 byte header for master and 60 byte header for detail
tables

Storage Issues: Pre-joining (Calculations)
With Normalization:
Total space used = 10 x 40 + 20 x 60 = 1.6 GB
After De-normalization:
Total space used = (60 + 40 – 10) x 20 = 1.8 GB
Net result is 12.5% additional space required in raw data table size for
the database.

Performance Issues: Pre-joining
Consider the query “How many members were paid claims during
last year?”
With Normalization:
Simply count the number of records in the master table.
After De-normalization:
The member_ID would be repeated, hence need a count distinct.
This will cause sorting on a larger table and degraded performance.

Why Performance Issues: Pre-joining
Depending on the query, the performance actually deteriorates with
de-normalization! This is due to the following three reasons:
• Forcing a sort due to count distinct.

• Using a table with 1.5 times header size.
• Using a table which is 2 times larger.
• Resulting in 3 times degradation in performance.
Bottom Line: Other than 0.2 GB additional space, also keep the 0.4
GB master table.

Performance Issues: Adding redundant
columns
Continuing with the previous Health-Care example, assuming a 60
byte detail table and 10 byte Sale_Person.
• Copying the Sale_Person to the detail table results in all scans taking 16%
longer than previously.
• Justifiable only if significant portion of queries get benefit by accessing the
de-normalized detail table.
• Need to look at the cost-benefit trade-off for each denormalization decision.

Other Issues: Adding redundant columns
Other issues include, increase in table size, maintenance and loss
of information:
• The size of the (largest table i.e.) transaction table increases by the size of the
Sale_Person key.
• For the example being considered, the detail table size increases from 1.2
GB to 1.32 GB.
• If the Sale_Person key changes (e.g. new 12 digit NID), then updates to be
reflected all the way to transaction table.
• In the absence of 1:M relationship, column movement will actually result in
loss of data.

Ease of use Issues: Horizontal Splitting
Horizontal splitting is a Divide & Conquer technique that exploits
parallelism. The conquer part of the technique is about combining the
results.
Lets see how it works for hash based splitting/partitioning.
• Assuming uniform hashing, hash splitting supports even data
distribution across all partitions in a pre-defined manner.
• However, hash based splitting is not easily reversible to eliminate the
split.


• Round robin and random splitting:

• Guarantee good data distribution.
• Almost impossible to reverse (or undo).
• Not pre-defined.

• Range and expression splitting:

• Can facilitate partition elimination with a smart
optimizer.
• Generally lead to "hot spots” (uneven distribution
of data).

Performance Issues: Horizontal Splitting
Dramatic cancellation of
airline reservations after
9/11, resulting in “hot
Processors spot”
P1 P2 P3 P4
1998 1999 2000 2001

Splitting based on year

Performance issues: Vertical Splitting Facts
Example: Consider a 100 byte header for the member table such that
20 bytes provide complete coverage for 90% of the queries.
Split the member table into two parts as follows:

1. Frequently accessed portion of table (20 bytes), and
2. Infrequently accessed portion of table (80+ bytes). Why 80+?
Note that primary key (member_id) must be present in both tables for
eliminating the split.

Performance issues: Vertical Splitting Good
vs. Bad
• Scanning the claim table for most frequently used queries will be
500% faster with vertical splitting
• Ironically, for the “infrequently” accessed queries the performance
will be inferior as compared to the un-split table because of the join
overhead.

Summary
• Normalization and its different forms
• De-normalization techniques
• Trade-off between different factors when considering Normalization
and de-normalization.

02 Normalization

Uploaded by

Copyright:

Available Formats

02 Normalization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Normalization

Uploaded by

Copyright:

Available Formats

Computer Science Faculty

Information Systems Department

Data warehousing & BI

Information Systems Department 3

Information Systems Department 4

Information Systems Department 5

• Problems associated with data redundancy are illustrated by

Information Systems Department 6

Information Systems Department 7

• In contrast, branch information appears only once for each branch in

Information Systems Department 8

Information Systems Department 9

• Two important properties of decomposition:

Information Systems Department 10

Information Systems Department 11

 Determinant of a functional dependency refers to

Information Systems Department 12

Information Systems Department 13

Information Systems Department 14

Information Systems Department 15

• Set of inference rules, called Armstrong’s axioms, specifies how new

Information Systems Department 16

Information Systems Department 17

Information Systems Department 18

• Often executed as a series of steps. Each step corresponds to a

• As normalization proceeds, relations become progressively more

Information Systems Department 19

Information Systems Department 20

• To create an un-normalized table:

Information Systems Department 21

Information Systems Department 22

• Identify repeating group(s) in unnormalized table which repeats for

Information Systems Department 23

Information Systems Department 24

• 2NF - A relation that is in 1NF and every non-primary-key attribute is

Information Systems Department 25

• Identify functional dependencies in the relation.

• If partial dependencies exist on the primary key remove them by

Information Systems Department 26

Information Systems Department 27

• Identify functional dependencies in the relation.

• If transitive dependencies exist on the primary key remove them by

Information Systems Department 28

• Third normal form (3NF)

Information Systems Department 29

• BCNF - A relation is in BCNF if and only if every determinant is a

Information Systems Department 30

• Whereas, BCNF insists that for this dependency to remain in a

• Every relation in BCNF is also in 3NF. However, relation in 3NF may

Information Systems Department 31

• Potential to violate BCNF may occur in a relation that:

Information Systems Department 32

Information Systems Department 33

Information Systems Department 34

Information Systems Department 35

Information Systems Department 36

• Possible existence of MVDs in a relation is due to 1NF and can result

Information Systems Department 37

Information Systems Department 38