Normalization -
Normalization is a process to reduce redundancy, ensure data integrity using functional dependency.
Databases which are not normalized frequently creates problem like Insertion anomaly, Deletion anomaly and modification anomaly of data and hence creates inconsistency.
In simple words, a single table should store single idea/information.
To understand normalization, it is important to understand the concept of Functional dependency and keys.
Functional Dependency -
Lets assume there is a relation Alpha -> Beta in a relational table.
If the value of alpha is known, using that value you can search the value of beta. This is functional dependency.
For the same value of Alpha, if you get different values of Beta, then it is not functional dependency.
Closure set of attributes -
This means what all values can be searched for the given value.
Keys -
Super key - super keys or keys a set of attributes used to uniquely identify a row among a set of rows.Candidate key - Minimal super key is candidate key.
steps to find a candidate key
1) If it is a super key
2) If any proper subset is a super key, if yes then it cannot be a candidate key, if no then it is candidate key.
In other words, if proper subset of a super key if not a super key is a candidate key.
Primary key - candidate key chosen to identify a row uniquely is called Primary key.
A simple example. Consider the relation below
a -> bcd
ab -> cd
abc -> d
bd -> ab
super key - a, ab, abc, bd
candidate key - a, bd
There are 4 kinds of Normalized forms.
1 NF -
A table is said to be in 1NF if every column contains only single attribute. 1NF does not allow multi value attribute in any column.1NF increases data redundancy as the number of rows with same column increases. Each row as a whole is unique.
2NF -
R (ab cd) lets say ab is candidate key
c is dependent on b
d is dependent on ab
prime attribute : if ab is candidate key, both a and b are prime attributes
non-prime attributes : attributes which are not a part of candidate key, c and d are non prime attributes.
c is dependent on a part of candidate key. This is called partial dependency.
Non prime attribute depends on a part of candidate key.
A table is said to be in 2NF , if it is in 1NF and there exists no partial dependency.
Issue with 2NF : primary key cant be null . so ab in together cannot be null, but b can be null with a having a valid value. This creates a problem, you cant find the value of c if b is null and a has a valid value.
3 NF -
Transitive dependency -
consider below relation
R(ab cd)
ab -> c
c-> d
A non prime attribute depends on another non prime attribute. This is transitive dependency.
A table is said to be in 3NF if it is in 2NF and there exists no transitive dependency.
That means there should not be any partial and transitive dependency.
Issue with 3NF : If c becomes null, you cant find the value of d.
BCNF -
BCNF stands for Boyce-codd normal formwhen a non prime attribute is derived from a prime/non prime attribute. consider below example
R (abc)
ab -> c
c-> b
for every dependency a -> b
a should be super key, then b can be ignored.
More details are here