Database Normalization

In relational design, the process of organizing data to minimize redundancy. Normalization is a systematic way of ensuring that a database structure is suitable for general- purpose querying and free of certain undesirable characteristics—insertion, update, and deletion anomalies—that could lead to a loss of . Normalization usually involves dividing a database into two or more tables and defining relationships between the tables. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one and then propagated through the rest of the database via the defined relationships.

Edgar F. Codd, the inventor of the , introduced the concept of normalization and what we now know as the (1NF) in 1970. Codd went on to define the (2NF) and (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974. Higher normal forms were defined by other theorists in subsequent years, the most recent being the (6NF) introduced by Chris Date, Hugh Darwen, and Nikos Lorentzos in 2002.

Functional dependency

In a given table, an attribute Y is said to have a on a set of attributes X (written X → Y) if and only if each X value is associated with precisely one Y value. For example, in an "Employee" table that includes the attributes "Employee ID" and "Employee Date of Birth", the functional dependency {Employee ID} → {Employee Date of Birth} would hold.

Trivial functional dependency

A trivial functional dependency is a functional dependency of an attribute on a superset of itself. {Employee ID, Employee Address} → {Employee Address} is trivial, as is {Employee Address} → {Employee Address}.

Full functional dependency

An attribute is fully functionally dependent on a set of attributes X if it is

• functionally dependent on X, and • not functionally dependent on any proper subset of X. {Employee Address} has a functional dependency on {Employee ID, Skill}, but not a full functional dependency, because it is also dependent on {Employee ID}.

Transitive dependency A transitive dependency is an indirect functional dependency, one in which X→Z only by virtue of X→Y and Y→Z.

Multivalued dependency

A multivalued dependency is a constraint according to which the presence of certain rows in a table implies the presence of certain other rows.

A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T.

Superkey

A is a combination of attributes that can be uniquely used to identify a database record. A table might have many superkeys.

Candidate key

A is a special subset of superkeys that do not have any extraneous information in them.

Examples: Imagine a table with the fields , , and . This table has many possible superkeys. Three of these are , and . Of those listed, only is a candidate key, as the others contain information not necessary to uniquely identify records

Non-prime attribute

A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would be a non-prime attribute in the "Employees' Skills" table.

Primary key

Most DBMSs require a table to be defined as having a single , rather than a number of possible unique keys. A is a key which the database designer has designated for this purpose.

1NF A R is in first normal form (1NF) if and only if all underlying domains contain atomic values only

Example: 1NF but not 2NF

FIRST (supplier_no, status, city, part_no, quantity)

Functional Dependencies:

(supplier_no, part_no) ® quantity

(supplier_no) ® status

(supplier_no) ® city city ® status (Supplier's status is determined by location)

Comments:

Non-key attributes are not mutually independent (city ® status). Non-key attributes are not fully functionally dependent on the primary key (i.e., status and city are dependent on just part of the key, namely supplier_no).

Anomalies:

INSERT: We cannot enter the fact that a given supplier is located in a given city until that supplier supplies at least one part (otherwise, we would have to enter a value for a participating in the primary key C a violation of the definition of a relation).

DELETE: If we delete the last (only) for a given supplier, we lose the information that the supplier is located in a particular city.

UPDATE: The city value appears many times for the same supplier. This can lead to inconsistency or the need to change many values of city if a supplier moves.

Decomposition (into 2NF):

SECOND (supplier_no, status, city)

SUPPLIER_PART (supplier_no, part_no, quantity)

2NF A relation R is in second normal form (2NF) if and only if it is in 1NF and every non-key attribute is fully dependent on the primary key

Example (2NF but not 3NF):

SECOND (supplier_no, status, city)

Functional Dependencies: supplier_no ® status supplier_no ® city city ® status

Comments:

Lacks mutual independence among non-key attributes.

Mutual dependence is reflected in the transitive dependencies: supplier_no ® city, city ® status.

Anomalies:

INSERT: We cannot record that a particular city has a particular status until we have a supplier in that city.

DELETE: If we delete a supplier which happens to be the last row for a given city value, we lose the fact that the city has the given status.

UPDATE: The status for a given city occurs many times, therefore leading to multiple updates and possible loss of consistency. Decomposition (into 3NF):

SUPPLIER_CITY (supplier_no, city)

CITY_STATUS (city, status)

3NF A relation R is in third normal form (3NF) if and only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. An attribute C is transitively dependent on attribute A if there exists an attribute B such that: A ® B and B ® C. Note that 3NF is concerned with transitive dependencies which do not involve candidate keys. A 3NF relation with more than one candidate key will clearly have transitive dependencies of the form: primary_key ® other_candidate_key ® any_non-key_column

An alternative (and equivalent) definition for relations with just one candidate key is:

A relation R having just one candidate key is in third normal form (3NF) if and only if the non-key attributes of R (if any) are: 1) mutually independent, and 2) fully dependent on the primary key of R. A non-key attribute is any column which is not part of the primary key. Two or more attributes are mutually independent if none of the attributes is functionally dependent on any of the others. Attribute Y is fully functionally dependent on attribute X if X ® Y, but Y is not functionally dependent on any proper subset of the (possibly composite) attribute X

For relations with just one candidate key, this is equivalent to the simpler:

A relation R having just one candidate key is in third normal form (3NF) if and only if no non-key column (or group of columns) determines another non-key column (or group of columns)

Example (3NF but not BCNF):

SUPPLIER_PART (supplier_no, supplier_name, part_no, quantity)

Functional Dependencies:

We assume that supplier_name's are always unique to each supplier. Thus we have two candidate keys:

(supplier_no, part_no) and (supplier_name, part_no)

Thus we have the following dependencies:

(supplier_no, part_no) ® quantity

(supplier_no, part_no) ® supplier_name (supplier_name, part_no) ® quantity

(supplier_name, part_no) ® supplier_no supplier_name ® supplier_no supplier_no ® supplier_name

Comments:

Although supplier_name ® supplier_no (and vice versa), supplier_no is not a non-key column — it is part of the primary key! Hence this relation technically satisfies the definition(s) of 3NF (and likewise 2NF, again because supplier_no is not a non-key column).

Anomalies:

INSERT: We cannot record the name of a supplier until that supplier supplies at least one part.

DELETE: If a supplier temporarily stops supplying and we delete the last row for that supplier, we lose the supplier's name.

UPDATE: If a supplier changes name, that change will have to be made to multiple rows (wasting resources and risking loss of consistency).

Decomposition (into BCNF):

SUPPLIER_ID (supplier_no, supplier_name)

SUPPLIER_PARTS (supplier_no, part_no, quantity)

BCNF A relation R is in Boyce-Codd normal form (BCNF) if and only if every determinant is a candidate key

The definition of BCNF addresses certain (rather unlikely) situations which 3NF does not handle. The characteristics of a relation which distinguish 3NF from BCNF are given below. Since it is so unlikely that a relation would have these characteristics, in practical real-life design it is usually the case that relations in 3NF are also in BCNF. Thus many authors make a "fuzzy" distinction between 3NF and BCNF when it comes to giving advice on "how far" to normalize a design. Since relations in 3NF but not in BCNF are slightly unusual, it is a bit more difficult to come up with meaningful examples. To be precise, the definition of 3NF does not deal with a relation that:

1. has multiple candidate keys, where 2. those candidate keys are composite, and 3. the candidate keys overlap (i.e., have at least one common attribute)

Example:

An example of a relation in 3NF but not in BCNF (and exhibiting the three properties listed) was given above in the discussion of 3NF. The following relation is in BCNF (and also in 3NF):

SUPPLIERS (supplier_no, supplier_name, city, zip) We assume that each supplier has a unique supplier_name, so that supplier_no and supplier_name are both candidate keys.

Functional Dependencies: supplier_no ® city supplier_no ® zip supplier_no ® supplier_name supplier_name ® city supplier_name ® zip supplier_name ® supplier_no

Comments:

The relation is in BCNF since both determinants (supplier_no and supplier_name) are unique (i.e., are candidate keys).

The relation is also in 3NF since even though the non-primary-key column supplier_name determines the non-key columns city and zip, supplier_name is a candidate key. Transitive dependencies involving a second (or third, fourth, etc.) candidate key in addition to the primary key do not violate 3NF.

Note that even relations in BCNF can have anomalies.

Anomalies:

INSERT: We cannot record the city for a supplier_no without also knowing the supplier_name

DELETE: If we delete the row for a given supplier_name, we lose the information that the supplier_no is associated with a given city.

UPDATE: Since supplier_name is a candidate key (unique), there are none.

Decomposition:

SUPPLIER_INFO (supplier_no, city, zip)

SUPPLIER_NAME (supplier_no, supplier_name)

Larry Newcomer (Updated January 06,2000)