Talk:Database normalization/Archive 1

From Wikipedia, the free encyclopedia

Archive This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Contents

Checking Page Content

I think the definition of 1NF has a problem. Non atomic entries in a row are supposed to lead to creation of additional rows, which will cause repetitions in the other fields. New tables are not created until 2NF. So I am being taught. Brent Gulanowski 22:28, 26 Oct 2003 (UTC)

Okay, how about some references then? These seem to support the article as written:


I agree that 1NF leads to new rows in a new table. I suggest the following sources, in order:
  1. http://publib.boulder.ibm.com/infocenter/ids9help/index.jsp?topic=/com.ibm.ddi.doc/ddi53.htm
  2. http://www.anaesthetist.com/mnm/sql/normal.htm
  3. http://databases.about.com/od/specificproducts/a/normalization.htm
Each of these three sources, the first 3 I reviewed, all are consistent with my own understanding. Which is:
The 1st normal form relates to the circumstance of multiple values for a single attribute. EG what are your favorite colors? Bob's favorites are blue and red. Jane's favorite colors are green, and blue.
The usual solution for relational database is to create two tables. One table containing Bob and Jane in their respective rows. A second table is in the form:
Bob Blue
Bob Red
Jane Green
Jane Blue
As I gain confidence in this process, I would like to contribute more to these pages
Bschmidt 21:41, 29 August 2005 (UTC)

Strictly speaking the definition of a normal form only tells you the properties of the form and not how you get there. If you look in Codd's seminal paper (http://www.acm.org/classics/nov95/s1p4.html#fig3a) you will see that he does immediately split off separate tables to get to 1NF, but this is in fact not always the proper way to proceed. Consider the following example: R1(a,R2), R2(b). If you immediately split off you get R1(a), R2(a,b) vs. simply R1(a,b) if you simply flatten. You will notice that the second form is clearly more correct. So from a theoretical point flattening is preferred (simple procedure, no exceptions, always correct) but in practice an immediate split gives you the end result more quickly although you have to pay attention to the exceptions (but almost nowhere is this explained properly). ... I really wish I had more time to work on this page. :-( -- Jan Hidders 09:56, 27 Oct 2003 (UTC)

While (in your example) R1(a,b) is "always correct" (meaning it cannot be incorrect), it may not be complete. I define "complete" as "meeting the requirements (the information-needs)". The absence of R1(a) lends itself to a non-deterministic representation of the obejctive reality, but the requirements may be for (in example) a control system. If the requirements (the information-need) calls for a list of valid values that are maintained by an end-user (assuming also that no data that lies outside that domain within R2(a,b) can be considered valid in the external reality), then the absence of a R1(a) tuple would be "incorrect" in that it is incomplete, in the ability of the database to meet the requirements.
Personally (since I deal almost exclusively with representational information systems, as opposed to control systems) I like the non-deterministic approach (flattening, as opposed to splitting). But Data modeling isn't about what we like, its about what meets the needs as understood by the requirements. So, it isn't about being "theoreticly correct", nor is it about "getting the end result more quickly", its about meeting the needs of the "customer" (the end-users and application programs that will use the resulting database). KeyStroke 11:44, 2004 Sep 22 (UTC)

Sixth normal form?

There seems to exist a sixth normal form according to the quote from Chris Date:

I'm guilty myself of recently introducing a new sixth normal form (6NF)! However, I can of course defend my use of that term; it goes beyond what I called "classical normalization theory" above, and extends that theory in a way that really is "academically respectable." See the book TEMPORAL DATA AND THE RELATIONAL MODEL, by myself, Hugh Darwen, and Nikos A. Lorentzos, published last month by Morgan Kaufmann.

http://www.pgro.uk7.net/fnf_0309.htm

Definition of "tuple"

The definition of a tuple is more complicated than it needs to be and not consistent with that used in other branches of maths

It would be better to define a tuple simply as a vector. Also a relation is not a tuple as defined in the article. - Plastic rat

I want to encourage you to improve the definition. However, please keep in mind that the purpose of a definition is to describe and explain the concept in terms that are simpler, and more familiar, with the reader. While you may be quite familiar with why it makes sense to describe a tuple as a vector, that won't make any sense to people who are not as familiar with math concepts as you are. We have to constantly remind ourselves that we aren't contributing so that we can demonstrate that we know more, or are smarter, or better educated than others. We contribute so that others may know more, and become smarter, and better educated than they are before reading what we contribute. This means that (to be effective to the purpose) we have to "dumb-down" the definitions, as opposed to (from the readers perspective) making them more obscure. KeyStroke 11:27, 2004 Sep 22 (UTC)

I want to support the above point: "...THAT WE AREN'T CONTRIBUTING SO THAT WE CAN DEMONSTRATE THAT WE KNOW MORE, OR ARE SMARTER, OR BETTER EDUCATED THAN OTHERS. WE CONTRIBUTE SO THAT OTHERS MAY KNOW MORE, AND BECOME SMARTER, AND BETTER EDUCATED THAN THEY ARE BEFORE READING WHAT WE CONTRIBUTE..." Is this copied from the editorial guide? If it is not canon then it should be. [irony intended :)] Can somebody please promote this quote to the Wikipedia masthead?

Sorry about editing your post, I wanted to remove the shouting. Well, off to research tuples. — Ambush Commander(Talk) 19:20, August 24, 2005 (UTC)

Dumbing it down -or- "Wikipedia for Dummies"

I visited this page to get an informed overview of the topic of database normalization. I was rather dismayed to find this overview limited to a single sentence: "Database normalization is a series of steps followed to obtain a database design that allows for consistent storage and efficient access of data in a relational database. These steps reduce data redundancy and the risk of data becoming inconsistent."

While I cannot possibly find fault with this opening statement, the remainder of the article was rather beyond me, and, I suspect, many others. I concede the point that someone searching Wikipedia looking for the topic of database normalization is likely to have at least a moderately informed understanding of database theory. However, this article, as currently written, does not, in my view, adequately explain the basic concept. I believe a brief discussion of the steps mentioned in the opening paragraph be added to this topic's definition.

In essence, I wanted to add my support of KeyStroke's plea for a more generally-understandable article, at least in the opening definition.

Seconded again. I used to think I understood relational databases, and then I read this thing. I think the problem is that it tires too hard to sound like an encyclopedia. While I realize that Wikipedia articles are supposed to be written in an encyclopedic tone, something as (apparently) complicated as this doesn't lend itself to high-flung language. I need examples or I'll have no idea what's going on; if we have to sacrifice some of the tone to make an understandable article, then so be it. Starwiz 20:33, Feb 5, 2005 (UTC)
I decided to mark the article for technical cleanup; feel free to discuss that decision here. Starwiz 22:41, Feb 18, 2005 (UTC)
What's the practicality for the use of this template? Isn't the evaluation of what's too technical going to vary by each individual reader? For example, if I (as a non-principal author) can understand the content of an article, is that grounds enough for removing it, or do we have to put it to a vote every time we want to remove such a tag? Over a month has gone by since the tag was added. Examples have been added. Can the template be removed? - Bevo 15:17, 22 Mar 2005 (UTC)
Wow, no fooling. This thing is unreadable. I don't think one needs a crash course in set theory to understand database normalization. I've moved the math-y content over to relational model, though perhaps those issues deserve their own article? Metaeducation 12:44, 30 May 2005 (UTC)
A suggestion by me is to take a UNF database example, and normalise it down to 3NF. -x42bn6 Talk 03:20, 12 October 2005 (UTC)

Addition: The About.com's article on Database Normalization could serve as a reference for this problem.

Some things to do?

The below was inlined into the article, and from what I could tell it was describing tasks to elaborate upon in-line in the article. I think it is preferable to keep that to-do list here. Metaeducation 12:44, 30 May 2005 (UTC)

Concentrate on the constraints that hold for individual relations, i.e., the relation constraints. The purpose of these constraints is to describe the relation universe, i.e., the set of all relations that are allowed to be associated with a certain relation name.

1NF

  • A relation is in 1NF if and only if all underlying domains contain scalar values only.
  • define NFNF relations
  • how to transform NFNF relations (also called UNF relations) to 1NF relations
    • how to transform the key constraints of nested relations
    • how to transform the functional dependencies of nested relations

3NF

  • example: Given this relation (#A, B, C) in 2NF where #A->B, #A->C and B->C (first two are fixed by 2NF). This relation isn't in 3NF because C is functionally dependent from B. This relation should be descomposed into (#A, B),(#B, C) or (#A, B), (#A, C) to be in 3NF.
  • (rissanen's rule ? )
  • how to transform from 2NF to 3NF
  • def: dependency preserving
  • can always be reached while staying dependency preserving

BCNF

  • example
  • how to transform from 3NF to BCNF
  • can not always be reached in a dependency preserving way.

4NF

  • example
  • how to transform from BCNF to 4NF

Multi-valued and join dependencies

  • def Given a relation (A,B,C) there is a multi-valued dependency between A and B noted by A->>B if and only if the set of values of B, given a fixed value (a: A, c: C), only depends on a and is independent of c.
  • example
    • trivial multi-value dependency (X->>Y is trivial if X+Y contains all attributes or Y is a subset of X)
  • reasoning rules for MVDs (it only can happen if the relation have at least tree attributes, all functional dependency implies multi-valued dependeny but not in the other direction)
  • def join dependency
  • example
  • reasoning rules for JDs
  • when is join dependency implied by key constraints?
  • Fagin's theorem: A relation (A,B,C) can be decomposed lossless in (A,B) and (A,C) if and only if A->> B | C is true in (A,B,C).
  • relationship between JDs and MVDs

5NF

  • example
  • from 4NF to 5NF
  • explain that ultimate normal form that can be reached with projections

Domain-key normal form

  • Add a section for domain-key normal form, if anyone knows what this even is. Metaeducation 12:44, 30 May 2005 (UTC)
  • Domain-key normal form (or DK/NF) requires that all constraints follow from the domain and the key constraints.
  • 1NF, 2NF, 3NF, BCNF, 4NF, 5NF are all subsets of the general const domains and keys.
  • has no modificational anomalies.

Other

  • embedded dependencies.
  • dependencies as statements in first-order logic.


Extra entities in 3NF?

Right now it says:

At 3NF it may become necessary to create additional entities in order to normalize relations beyond 3NF. From an implementation standpoint, this is not always desirable due to possible performance issues, and sheer complexity. See denormalizaiton.

Can anyone explain what this refers to? What entities are we talking about here? Relations? Columns? And why are they necessary? -- Jan Hidders 17:39, 17 Nov 2004 (UTC)

  • Existing tables are split into multiple tables which must then be re-joined each time a query is issued. Joining the tables requires more work by the server than just spitting out rows. It's also more complicated, but it saves a lot of headaches when it comes to data integrity. --Teknic

Symbols

I removed this text regarding technical cleanup from the top of the article. It was added by 83.248.150.251:

Cleanup addition: Also, where do you find definitions of symbols like "->" and "->>"? The content cannot really be understood without having those defined

In response I would say that these symbols should be explained in Relational model or Relational database if they are not already. I'll leave that up to somebody that knows the subject better. --Teknic

1NF

Elmasri / Navathes "fundamentals of database systems" seems to disagree on 1NF (see page 486.)

It states, regarding a relation with the attribute "DEPARTMENTLOCATIONS" (which holds non-atomic values such as {Bellaire, Sugarland, Houston)): "If a maximum number of values is known for the attribute-for example if it is known that at most three locations can exist for the department-replace the DLOCATIONS attribute by three atomic attributes: DLOCATION1, DLOCATION2 and DLOCATION3. This solution has the disadvantage of introducing null values if most departments have fewer than three locations."

This would seem to contradict with the article which reads: "ITEM_1_NAME ITEM_1_PRICE ITEM_1_QUANTITY ITEM_2_NAME ITEM_2_PRICE ITEM_2_QUANTITY ... ITEM_N_NAME ITEM_N_PRICE ITEM_N_QUANTITY The attributes for holding information about each Item on the Order are repeated for the number of different Items ordered."

Obviously the solution given by EN isn't one you would want to see in a real database, and they themselves make this clear, but it does contradict with the articles definition of 1NF. Is EN wrong? is there a canonical definition of 1NF somewhere? Can someone enlighten me?  :-? GeorgeBills 13:22, 31 May 2005 (UTC)

The term normal form suggests that for a given level of knowledge about the relationships in a database, there is only one schema that fits the definition. (Or so I would read it.) The allowance of null values, since they do not "count", runs counter to the idea. (It sounds a bit like saying that 10 is a canonical representation of "ten", but so is 10.0) I'm only just starting to grasp the idea of schema = f(normal form type, set theory relationships of quantities involved in database); that sounds cool in my head, but is it true? But if it's true, I'd think that only one schema can come back for two parameters. This means null values are out. Right? Metaeducation 08:35, 2 Jun 2005 (UTC)

Problem with 4NF example

I'm not convinced by the example given of fourth normal form. It seems to me that the job category and location fields are not independent - i.e. an employee fills a job category at a particular location (and might fill a different job category at a different location). Consider an example where there are two employees, two job categories and two locations. The data might be set up in the original table as follows:

EMPLOYEE_ID JOB_CATEGORY_ID LOCATION_ID
Anne Supervisor Monaco
Anne Consultant Berlin
Bill Consultant Monaco
Bill Supervisor Berlin

Or it might be thus:

EMPLOYEE_ID JOB_CATEGORY_ID LOCATION_ID
Anne Consultant Monaco
Anne Supervisor Berlin
Bill Consultant Monaco
Bill Supervisor Berlin

Either way, the proposed changes to the schema would end up with the following three tables:

EMPLOYEE_ID JOB_CATEGORY_ID
Anne Consultant
Anne Supervisor
Bill Consultant
Bill Supervisor
EMPLOYEE_ID LOCATION_ID
Anne Monaco
Anne Berlin
Bill Monaco
Bill Berlin
JOB_CATEGORY_ID LOCATION_ID
Consultant Paris
Consultant Berlin
Supervisor Paris
Supervisor Berlin

The information about which job is done by which employee where has been lost. All we know now is that both employees work in both locations, both employees perform both job roles, and that both job roles are done at both locations.

There is a similar example given by [William Kent http://www.bkent.net/Doc/simple5.htm#label4], except there the point is clearly made that the two properties of the employee have to be independent in order to require a change to put the database in 4NF. I propose to change the example to one where the properties of the employees are independent.

Go ahead. I recently did a lot of formatting and "dumbifying" of the text, so some parts might have gotten lost in the translation. — Ambush Commander(Talk) 16:42, August 10, 2005 (UTC)

NFs are just common sense, with rigor

This was a real "oh-HO" moment that I got from Date's classic text. I think this might help provide an overview explanation that's more accesible. The core of it is that databases can have structural problems that may lead to errors - losing data or creating false apparent data. The normal forms are just formalizied versions of pretty simple ideas. The expression becomes less simple exactly because they are formalized - they either become longer (to cover all the edge cases) or use jargon (which may do that more briefly, but only if you know what all the terms mean). The reason NFs are important is that they more or less mechanize the process of identifying and fixing the structural flaws.


1NF Simplified

The first normal form is now in real layman terms. It offers plenty of examples, and is easily understood. Let's go for this sort of clarity with 2NF and 3NF. Good work guys! — Ambush Commander(Talk) 22:02, 2 October 2005 (UTC)

It does however have the problem that what is now written under that header has nothing to do with being in 1NF. Is somebody going to fix that? -- Jan Hidders 21:30, 24 January 2006 (UTC)
You're right. I think we've gone too far in simplifying it. We should first offer a concise definition (both formal and informal), then give the example. Hmm... — Ambush Commander(Talk) 21:47, 24 January 2006 (UTC)