DBMS Notes: Database: Database Is A Collection of Inter-Related Data Which Helps in Efficient
DBMS Notes: Database: Database Is A Collection of Inter-Related Data Which Helps in Efficient
DBMS Notes: Database: Database Is A Collection of Inter-Related Data Which Helps in Efficient
DDL is short name of Data Definition Language, which deals with database schemas
and descriptions, of how the data should reside in the database.
DML is short name of Data Manipulation Language which deals with data
manipulation and includes most common SQL statements such SELECT, INSERT,
UPDATE, DELETE, etc., and it is used to store, modify, retrieve, delete and update
data in a database.
Advantages of DBMS
1. Minimized redundancy and data inconsistency: Data is normalized in DBMS to
minimize the redundancy which helps in keeping data consistent.
2. Simplified Data Access: A user need only name of the relation not exact location to
access data, so the process is very simple.
3. Multiple data views: Different views of same data can be created to cater the needs
of different users. For Example, faculty salary information can be hidden from
student view of data but shown in admin view.
4. Data Security: Only authorized users are allowed to access the data in DBMS. Also,
data can be encrypted by DBMS which makes it secure.
5. Concurrent access to data: Data can be accessed concurrently by different users at
same time in DBMS.
6. Backup and Recovery mechanism: DBMS backup and recovery mechanism helps to
avoid data loss and data inconsistency in case of catastrophic failures.
Levels in DBMS
1. Physical Level: At the physical level, the information about the location of database
objects in the data store is kept
2. Conceptual Level/logical: At conceptual level, data is represented in the form of
various database tables. For Example, STUDENT database may contain STUDENT
and COURSE tables which will be visible to users but users are unaware of their
storage.
3. External Level/views: An external level specifies a view of the data in terms of
conceptual level tables. Ex. A professor might only want to see the marks of the
students and not interested in other details about the students.
Data independence means a change of data at one level should not affect another
level. Two types of data independence are present in this architecture:
1. Physical Data Independence: Any change in the physical location of tables and
indexes should not affect the conceptual level or external view of data.
2. Conceptual Data Independence: The data at conceptual level schema and
external level schema must be independent. This means a change in conceptual
schema should not affect external schema. e.g.; Adding or deleting attributes of
a table should not affect the user’s view of the table.
DBMS Architecture 2-Level, 3-Level
Two tier architecture:
Two tier architecture is similar to a basic client-server model. The application at the
client end directly communicates with the database at the server side. API’s like
ODBC,JDBC are used for this interaction. An advantage of this type is that
maintenance and understanding is easier, compatible with existing systems.
However this model gives poor performance when there are a large number of users.
4. Sophisticated Users :
Sophisticated users can be engineers, scientists, business analyst, who are familiar with the
database. they interact the data base by writing SQL queries directly through the query
processor.
5. Data Base Designers :
Data Base Designers are the users who design the structure of data base which includes tables,
indexes, views, constraints, triggers, stored procedures.
6. Application Program :
Application Program are the back end programmers who writes the code for the application
programs.
7. Casual Users / Temporary Users :
Casual Users are the users who occasionally use/access the data base but each time when they
access the data base they require the new information, for example, Middle or higher le vel
manager.
Disadvantages of DBMS
1. Increased Cost:
These are different types of costs:
1. Cost of Hardware and Software –
2. Cost of Staff Training –
ER MODEL
ER Model is used to model the logical view of the system from data perspective
which consists of these components:
Entity, Entity Type, Entity Set –
An Entity may be an object with a physical existence – a particular person, car,
house, or employee
An Entity is an object of Entity Type and set of all entities is called as entity set.
Attributes are the properties which define the entity type. For example, Roll_No,
Name, DOB, Age, Address, Mobile_No are the attributes which defines entity type
Student. In ER diagram, attribute is represented by an oval.
1. Key Attribute –
The attribute which uniquely identifies each entity in the entity set is called key
attribute.For example, Roll_No will be unique for each student. In ER diagram,
key attribute is represented by an oval with underlying lines.
2. Composite Attribute –
An attribute composed of many other attribute is called as composite attribute
3. Multivalued Attribute –
An attribute consisting more than one value for a given entity.
4. Derived Attribute –
An attribute which can be derived from other attributes of the entity type is
known as derived attribute. e.g.; Age (can be derived from DOB).
2. Binary Relationship –
When there are TWO entities set participating in a relation, the relationship is
called as binary relationship.For example, Student is enrolled in Course
3. n-ary Relationship –
Cardinality:
The number of times an entity of an entity set participates in a relationship set is
known as cardinality. Cardinality can be of different types:
1. One to one – When each entity in each entity set can take part only once in the
relationship, the cardinality is one to one. Let us assume that a male can marry
to one female and a female can marry to one male. So the relationship will be
one to one.
2. Many to one – When entities in one entity set can take part only once in the
relationship set and entities in other entity set can take part more than once
in the relationship set, cardinality is many to one. Let us assume that a student
can take only one course but one course can be taken by many students. So the
cardinality will be n to 1. It means that for one course there can be n students
but for one student, there will be only one course.
3. Many to many – When entities in all entity sets can take part more than once
in the relationship cardinality is many to many. Let us assume that a student
can take more than one course and one course can be taken by many students
Participation Constraint:
Participation Constraint is applied on the entity participating in the relationship set.
1. Total Participation – Each entity in the entity set must participate in the
relationship. If each student must enroll in a course, the participation of student
will be total. Total participation is shown by double line in ER diagram.
2. Partial Participation – The entity in the entity set may or may NOT participate
in the relationship.
Using set, it can be represented as,
Enhanced ER Model
complexity of the data is increasing so it becomes more and more difficult to use
the traditional ER model for database modeling. To reduce this complexity of
modeling we have to make improvements or enhancements were made to the
existing ER model to make it able to handle the complex application in a better
way.
Relational Model
Relational Model represents how data is stored in Relational Databases. A
relational database stores data in the form of relations (tables).
SUPER KEYS:
Any set of attributes that allows us to identify unique rows (tuples) in a given
relation are known as super keys. Out of these super keys we can always choose a
proper subset among these which can be used as a primary key. Such keys are
known as Candidate keys. If there is a combination of two or more attributes which
is being used as the primary key then we call it as a Composite key.
Candidate Key: The minimal set of attribute which can uniquely identify a tuple is
known as candidate key. For Example, STUD_NO in STUDENT relation.
• The value of Candidate Key is unique and non-null for every tuple.
• There can be more than one candidate key in a relation. For Example, STUD_NO
is candidate key for relation STUDENT.
Primary Key: There can be more than one candidate key in relation out of which
one can be chosen as the primary key.
Alternate Key: The candidate key other than the primary key is called an alternate
key.
Foreign Key: If an attribute can only take the values which are present as values of
some other attribute, it will be a foreign key to the attribute to which it refers. The
relation which is being referenced is called referenced relation and the
corresponding attribute is called referenced attribute and the relation which refers
to the referenced relation is called referencing relation and the corresponding
attribute is called referencing attribute. The referenced attribute of the referenced
relation should be the primary key for it. For Example, STUD_NO in
STUDENT_COURSE is a foreign key to STUD_NO in STUDENT relation.
SUPER KEYS:
Any set of attributes that allows us to identify unique rows (tuples) in a given
relation are known as super keys. Out of these super keys we can always choose a
proper subset among these which can be used as a primary key. Such keys are
known as Candidate keys. If there is a combination of two or more attributes which
is being used as the primary key then we call it as a Composite key.
ANOMALIES
An anomaly is an irregularity, or something which deviates from the expected or
normal state. When designing databases, we identify three types of
anomalies: Insert, Update and Delete.
Insertion Anomaly in Referencing Relation:
We can’t insert a row in REFERENCING RELATION if referencing attribute’s value is
not present in referenced attribute value.
Deletion/ Updation Anomaly in Referenced Relation:
We can’t delete or update a row from REFERENCED RELATION if value of
REFERENCED ATTRIBUTE is used in value of REFERENCING ATTRIBUTE
ON DELETE CASCADE: It will delete the tuples from REFERENCING RELATION
if value used by REFERENCING ATTRIBUTE is deleted from REFERENCED RELATION.
ON UPDATE CASCADE: It will update the REFERENCING ATTRIBUTE in REFERENCING
RELATION if attribute value used by REFERENCING ATTRIBUTE is updated in
REFERENCED RELATION.
SQL JOIN
An SQL Join is used to combine data from two or more tables, based on a common field
between them. For example, consider the following two tables.
Note: INNER is optional above. Simple JOIN is also considered as INNER JOIN
What is the difference between inner join and outer join?
Outer Join is of 3 types
1) Left outer join
2) Right outer join
3) Full Join
1) Left outer join returns all rows of table on left side of join. The rows for which
there is no matching row on right side, result contains NULL in the right side.
2) Right Outer Join is similar to Left Outer Join (Right replaces Left everywhere)
3) Full Outer Join Contains results of both Left and Right outer joins.
An attribute that is not part of any candidate key is known as non-prime attribute.
An attribute that is a part of one of the candidate keys is known as prime attribute.
A relation is in first normal form if every attribute in that relation is singled valued
attribute.
Transitive dependency – If A->B and B->C are two FDs then A->C is called transitive
dependency.
A relation R is in BCNF if R is in Third Normal Form and for every FD, LHS is super
key. A relation is in BCNF iff in every non-trivial functional dependency X –> Y, X is a
super key.
1. BCNF is free from redundancy.
2. If a relation is in BCNF, then 3NF is also also satisfied.
3. If all attributes of relation are prime attribute, then the relation is always in
3NF.
In Lossless Decomposition we select the common element and the criteria for
selecting common element is that the common element must be a candidate key or
super key in either of relation R1/R2 or both.
Lossy decomposition : The decompositions R1, R2, R2…Rn for a relation schema R are
said to be Lossy if there natural join results into addition of extraneous(spurious) tuples
with the the original relation R.
A transaction is a single logical unit of work which accesses and possibly modifies the
contents of a database. Transactions access data using read and write operations.
Properties of transactions:
Atomicity
By this, we mean that either the entire transaction takes place at once or doesn’t
happen at all. There is no midway i.e. transactions do not occur partially. Each
transaction is considered as one unit and either runs to completion or is not executed
at all. It involves the following two operations.
—Abort: If a transaction aborts, changes made to database are not visible.
—Commit: If a transaction commits, changes made are visible.
Consistency
This means that integrity constraints must be maintained so that the database is
consistent before and after the transaction. It refers to the correctness of a database.
Isolation
This property ensures that multiple transactions can occur concurrently without
leading to the inconsistency of database state. Transactions occur independently
without interference. Changes occurring in a particular transaction will not be visible
to any other transaction until that particular change in that transaction is written to
memory or has been committed. This property ensures that the execution of
transactions concurrently will result in a state that is equivalent to a state achieved
these were executed serially in some order.
Durability:
This property ensures that once the transaction has completed execution, the updates
and modifications to the database are stored in and written to disk and they persist
even if a system failure occurs. These updates now become permanent and are stored
in non-volatile memory.
Locking in DBMS
Because all database modifications must be preceded by creation of log record, the
system has available both the old value prior to modification of data item and new
value that is to be written for data item.
1. Undo: using a log record sets the data item specified in log record to old value.
2. Redo: using a log record sets the data item specified in log record to new value.
The database can be modified using two approaches –
1. Deferred Modification Technique: If the transaction does not modify the
database until it has partially committed, it is said to use deferred modification
technique.
2. Immediate Modification Technique: If database modification occur while
transaction is still active, it is said to use immediate modification technique.
Use of Checkpoints –
When a system crash occurs, user must consult the log. In principle, that need to
search the entire log to determine this information. There are two major difficulties
with this approach:
1. The search process is time-consuming.
2. Most of the transactions that, according to our algorithm, need to be redone
have already written their updates into the database. Although redoing them
will cause no harm, it will cause recovery to take longer.
What is a Checkpoint ?
The checkpoint is used to declare a point before which the DBMS was in the
consistent state, and all transactions were committed. During transaction execution,
such checkpoints are traced. After execution, transaction log files will be created.
Dirty Reads –
When a transaction is allowed to read a row that has been modified by an another
transaction which is not committed yet that time Dirty Reads occurred. It is mainly
occurred because of multiple transaction at a time which is not committed.
1. Conflict Serializable:
A schedule is called conflict serializable if it can be transformed into a serial
schedule by swapping non-conflicting operations. Two operations are said to be
conflicting if all conditions satisfy:
• They belong to different transactions
• They operate on the same data item
• At Least one of them is a write operation
2. View Serializable:
A Schedule is called view serializable if it is view equal to a serial schedule (no
overlapping transactions), i.e. some arrangement of transactions is serializable.
A conflict schedule is a view serializable but if the serializability contains blind
writes, then the view serializable does not conflict serializable. If no blind write
then surely not view serializable.
View equivalence conditions:
S1 and S2 are schedules , T is transactions in that schedules
1) Initial Read
If a transaction T1 reading data item A from database in S1 then in S2 also T1
should read A from database.
2)Updated Read
If Ti is reading A which is updated by Tj in S1 then in S2 also Ti should read A which
is updated by Tj.
3)Final Write operation
If a transaction T1 updated A at last in S1, then in S2 also T1 should perform final
write operations.
Cascading schedule : When there is a failure in one transaction and this leads to the
rolling back or aborting other dependent transactions, then such scheduling is
referred to as Cascading rollback or cascading abort
Cascadeless Schedule:
Schedules in which transactions read values only after all transactions whose changes
they are going to read commit are called cascadeless schedules. Avoids that a single
transaction abort leads to a series of transaction rollbacks.
Transaction Isolation Levels in DBMS
1. Read Uncommitted – Read Uncommitted is the lowest isolation level. In this
level, one transaction may read not yet committed changes made by other
transaction, thereby allowing dirty reads. In this level, transactions are not
isolated from each other.
2. Read Committed – This isolation level guarantees that any data read is
committed at the moment it is read. Thus it does not allows dirty read.
3. Repeatable Read – This is the most restrictive isolation level. The transaction
holds read locks on all rows it references and writes locks on all rows it inserts,
updates, or deletes. Since other transaction cannot read, update or delete these
rows, consequently it avoids non-repeatable read.
4. Serializable – This is the Highest isolation level. A serializable execution is
guaranteed to be serializable.
Database recovery:
• From Backup
• From checkpoint
• From the logs
• Undo the transactions
• Caching/Buffering
Starvation or Livelock is the situation when a transaction has to wait for a indefinite
period of time to acquire a lock.
Reasons of Starvation –
• If waiting scheme for locked items is unfair. ( priority queue )
Soln :
• Increasing priority of waiting transaction
• Using different algorithm such as FCFS, non priority based.
SQL vs NoSQL
SQL NoSQL
These databases are not suited for These databases are best suited for
hierarchical data storage. hierarchical data storage.
These databases are best suited for These databases are not so good for
complex queries complex queries
INDEXING
Indexing is a data structure technique which allows you to quickly retrieve
records from a database file. An Index is a small table having only two columns.
The first column comprises a copy of the primary or candidate key of a table. Its
second column contains a set of pointers for holding the address of the disk block
where that specific key value stored.
It helps you to reduce the total number of I/O operations needed to retrieve that
data, so you don't need to access a row in the database from an index structure.
An index -
Primary Index is an ordered file which is fixed length size with two fields. The first
field is the same a primary key and second, filed is pointed to that specific data
block. In the primary Index, there is always one to one relationship between the
entries in the index table.
The primary Indexing in DBMS is also further divided into two types.
• Dense Index : a record is created for every search key valued in the
database. This helps you to search faster but needs more space to store
index records.
• Sparse Index : It is an index record that appears for only some of the values
in the file. In this method of indexing technique, pointer points to some key
in a block , and using that other keys can be obtained using that block
address.
Clustering Index
Used when we have ordered data but non keys, i.e. repetitive entries. The data file
is ordered on a non-key field. Records with similar characteristics are grouped
together and indexes are created for these groups. Sparse Index table is used.
The actual data here(information on each page of the book) is not organized but we
have an ordered reference(contents page) to where the data points actually lie.
We can have only dense ordering as data is not ordered.It requires more time than
other indexing.
Secondary Primary
Multilevel Indexing in Database is created when a primary index does not fit in
memory. Index is divided into several other indexes and we maintain a sparse
table to refer it.
B-Tree is a self-balancing search tree. In most of the other self-balancing search trees
(like AVL and Red-Black Trees), it is assumed that everything is in main memory. To
understand the use of B-Trees, we must think of the huge amount of data that cannot
fit in main memory. When the number of keys is high, the data is read from disk in
the form of blocks.
The main idea of using B-Trees is to reduce the number of disk accesses. Most of the
tree operations (search, insert, delete, max, min, ..etc ) require O(h) disk accesses
where h is the height of the tree.
-tree is a fat tree. The height of B-Trees is kept low by putting maximum possible keys
in a B-Tree node.
Generally, the B-Tree node size is kept equal to the disk block size.
WHY B+ Trees
B-Trees stores the data pointer (a pointer to the disk file block containing the key
value), corresponding to a particular key value, along with that key value in the node
of a B-tree. This technique, greatly reduces the number of entries that can be packed
into a node of a B-tree. B+ tree eliminates the above drawback by storing data
pointers only at the leaf nodes of the tree.