Database normalization

From Wikipedia, the free encyclopedia

This article or section is in need of attention from an expert on the subject.
WikiProject Computer science or the Computer science Portal may be able to help recruit one.
If a more appropriate WikiProject or portal exists, please adjust this template accordingly.

This article or section may be confusing or unclear for some readers.
Please improve the article or discuss this issue on the talk page. This article has been tagged since December 2006.

In relational database theory, normalization is the process of restructuring the logical data model of a database to eliminate redundancy, organize data efficiently, reduce repeating data and to reduce the potential for anomalies during data operations. Data normalization also may improve data consistency and simplify future extension of the logical data model. The formal classifications used for describing a relational database's level of normalization are called normal forms (abbrev. NF).

1 Description
2 History
3 Normal forms
4 Denormalization
5 Non-first normal form (NF²)
6 Further reading
7 References
8 External links
9 See also

[edit] Description

A non-normalized database can suffer from data anomalies:

A non-normalized database may store data representing a particular referent in multiple locations. An update to such data in some but not all of those locations results in an update anomaly, yielding inconsistent data. A normalized database prevents such an anomaly by storing such data (i.e. data other than primary keys) in only one location.
A non-normalized database may have inappropriate dependencies, i.e. relationships between data with no functional dependencies. Adding data to such a database may require first adding the unrelated dependency. A normalized database prevents such insertion anomalies by ensuring that database relations mirror functional dependencies.
Similarly, such dependencies in non-normalized databases can hinder deletion. That is, deleting data from such databases may require deleting data from the inappropriate dependency. A normalized database prevents such deletion anomalies by ensuring that all records are uniquely identifiable and contain no extraneous information.

Normalized databases have a design that reflects the true dependencies between tracked quantities, allowing quick updates to data with little risk of introducing inconsistencies. Instead of attempting to lump all information into one table, data is spread out logically into many tables. Normalizing the data is decomposing a single relation into a set of smaller relations which satisfy the constraints of the original relation. Redundancy can be solved by decomposing the tables. However certain new problems are caused by decomposition. Normalization helps us to make a conscious decision to avoid redundancy keeping the pros and cons in mind.

One can only describe a database as having a normal form if the relationships between quantities have been rigorously defined. It is possible to use set theory to express this knowledge once a problem domain has been fully understood, but most database designers model the relationships in terms of an "idealized schema". (The mathematical support came back into play in proofs regarding the process of transforming from one form to another.)

The transformation of conceptual model to computer representation format is known as Normalization..

[edit] History

Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form:

“	There is, in fact, a very simple elimination* procedure which we shall call normalization. Through decomposition nonsimple domains are replaced by "domains whose elements are atomic (nondecomposable) values."	„
—Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks^[1]

* His term eliminate is misleading, as nothing is "lost" in normalization. He probably described eliminate in a mathematical sense to mean elimination of complexity.

In his paper, Codd used the term "nonsimple" domains to describe a heterogeneous data structure, but later researchers would refer to such a structure as an abstract data type.

[edit] Normal forms

Edgar F. Codd originally defined the first three normal forms. The first normal form requires that tables be made up of a primary key and a number of atomic fields, and the second and third deal with the relationship of non-key fields to the primary key. These have been summarised as requiring that all non-key fields be dependent on "the key, the whole key and nothing but the key". In practice, most applications in 3NF are fully normalized. However, research has identified potential update anomalies in 3NF databases. BCNF is a further refinement of 3NF that attempts to eliminate such anomalies.

The fourth and fifth normal forms (4NF and 5NF) deal specifically with the representation of many-many and one-many relationships. Sixth normal form (6NF) only applies to temporal databases.

[edit] First normal form

Main article: First normal form

First normal form (1NF) lays the groundwork for an organised database design:

Ensure that each table has a primary key: minimal set of attributes which can uniquely identify a record.
Eliminate repeating groups (categories of data which would seem to be required a different number of times on different records) by defining keyed and non-keyed attributes appropriately.
Atomicity: Each attribute must contain a single value, not a set of values.

[edit] Second normal form

Main article: Second normal form

Second normal form (2NF) If a table has a composite key, all attributes must be related to the whole key:

The database must meet all the requirements of the first normal form.
Data which is redundantly duplicated across multiple rows of a table is moved out to a separate table.

[edit] Third normal form

Main article: Third normal form

Third normal form (3NF) requires that data stored in a table be dependent only on the primary key, and not on any other field in the table.

The database must meet all the requirements of the second normal form.
Any field which is dependent not only on the primary key but also on another field is moved out to a separate table.

[edit] Boyce-Codd normal form

Main article: Boyce-Codd normal form

Boyce-Codd normal form (or BCNF) requires that there be no non-trivial functional dependencies of attributes on something other than a superset of a candidate key (called a superkey).

[edit] Fourth normal form

Main article: Fourth normal form

Fourth normal form (or 4NF) requires that there be no non-trivial multivalued dependencies of attribute sets on something other than a superset of a candidate key. A table is said to be in 4NF if and only if it is in the BCNF and multivalued dependencies are functional dependencies. The 4NF removes unwanted data structures: multivalued dependencies.

[edit] Fifth normal form

Main article: Fifth normal form

Fifth normal form (5NF and also PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A table is said to be in the 5NF if and only if it is in 4NF and every join dependency in it is implied by the candidate keys.

[edit] Domain/key normal form

Main article: Domain/key normal form

Domain/key normal form (or DKNF) requires that the database contains no constraints other than domain constraints and key constraints.

[edit] Sixth normal form

It has been suggested that this section be split into a new article entitled Sixth normal form. (Discuss)

This normal form was, as of 2005, only recently proposed: the sixth normal form (6NF) was only defined when extending the relational model to take into account the temporal dimension. Unfortunately, most current SQL technologies as of 2005 do not take into account this work, and most temporal extensions to SQL are not relational. See work by Date, Darwen and Lorentzos^[2] for a relational temporal extension, or see TSQL2 for a different approach.

Tree structure data implementation or Logic-Based Database Hierarchical model using relational model needs to be considered in addition to normalization for a useful database design.

[edit] Denormalization

Main article: Denormalization

Databases intended for Online Transaction Processing (OLTP) are typically more normalized than databases intended for On Line Analytical Processing (OLAP). OLTP Applications are characterized by a high volume of small transactions such as updating a sales record at a super market checkout counter. The expectation is that each transaction will leave the database in a consistent state. By contrast, databases intended for OLAP operations are primarily "read only" databases. OLAP applications tend to extract historical data that has accumulated over a long period of time. For such databases, redundant or "denormalized" data may facilitate Business Intelligence applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or redundant data must be carefully controlled during ETL processing, and users should not be permitted to see the data until it is in a consistent state. The normalized alternative to the star schema is the snowflake schema.

Denormalization is also used to improve performance on smaller computers as in computerized cash-registers. Since these use the data for look-up only (e.g. price lookups), no changes are to be made to the data and a swift response is crucial.

[edit] Non-first normal form (NF²)

In recognition that denormalization can be deliberate and useful, the non-first normal form is a definition of database designs which do not conform to the first normal form, by allowing "sets and sets of sets to be attribute domains" (Schek 1982). This extension introduces hierarchies in relations.

Consider the following table:

Non-First Normal Form
Person	Favorite Colors
Bob	blue, red
Jane	green, yellow, red

Assume a person has several favorite colors. Obviously, favorite colors consist of a set of colors modeled by the given table.

To transform this NF² table into a 1NF an "unnest" operator is required which extends the relational algebra of the higher normal forms. The reverse operator is called "nest" which is not always the mathematical inverse of "unnest", although "unnest" is the mathematical inverse to "nest". Another constraint required is for the operators to be bijective, which is covered by the Partitioned Normal Form (PNF).

[edit] Further reading

Litt's Tips: Normalization
Date, C. J., & Lorentzos, N., & Darwen, H. (2002). Temporal Data & the Relational Model (1st ed.). Morgan Kaufmann. ISBN 1-55860-855-9.
Date, C. J. (1999), An Introduction to Database Systems (8th ed.). Addison-Wesley Longman. ISBN 0-321-19784-4.
Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory, Communications of the ACM, vol. 26, pp. 120-125
Date, C.J., & Darwen, H., & Pascal, F. Database Debunkings
H.-J. Schek, P.Pistor Data Structures for an Integrated Data Base Management and Information Retrieval System