We have Normality

A Simple Guide to Database Normalization

Created by Kevin Feasel

Who Am I? What Am I Doing Here?

Catallaxy Services
@feaselkl
Curated SQL
We Speak Linux

What Is Normalization?

Technical Definition

“Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency.”

Source: Wikipedia

What Is Normalization?

In other words...

Normalization is:

  • Consolidating "redundant" (duplicate) attributes
  • Ensuring that an entity's attributes truly belong

Why Normalize?

There are several reasons to normalize your SQL Server databases. For example, normalization will help you with...

Data Integrity

Each normal form is designed to prevent some kind of data synchronization problem. The better-normalized your tables are, the less likely you are to have data consistency issues.

Eliminate Duplicate Data

Normalization reduces the chance of update anomalies: data being correct in one row, column, or table but being incorrect in another.

In addition, eliminating duplicate data reduces database disk space requirements and ensures that a higher percentage of our database pages can be stored in memory. Pages in memory are significantly faster than pages on disk, even with SSDs.

Avoid Unnecessary Coding

You won't need to update several tables to get all of the columns which may contain an attribute's values. This also leads to fewer "the app can't handle this" ad hoc scripts to write.

Improve Performance

Normalize until it hurts; denormalize until it works

This quotation might have made sense in the 1980s, but is mostly invalid with modern hardware and software. Remember that SQL Server is designed around normalized table structures, meaning that highly abnormal structures (e.g., regularly shredding XML within T-SQL) can actually cause significant performance problems.

Denormalization as a performance enhancement only works after you actually have a normalized data set. Otherwise, your "optimization" will likely be suboptimal.

Denormalization also adds complexity to using and maintaining your data.

Normal Forms

Six normal forms are relatively common. There are additional normal forms (6NF and up) but these remain academic.

The First Normal Forms

“Non-key attributes must provide a detail about the key, the whole key, and nothing but the key.”

This statement describes 1NF (the key), 2NF (the whole key), 3NF (nothing but the key), and BCNF. When you follow all of the rules of the above statement, you will be in (at least) Boyce-Codd Normal Form.

We are discussing only single-valued facts--that is, 1:1 or 1:M relationships. We forbid M:N relationships as a rule. To implement an M:N relationship, we need three tables: a 1:M table, a 1:N table, and a junction table to join the two together.

1st Normal Form

1st Normal Form has four basic requirements:

  1. Records have a consistent shape
  2. All instances of an entity must be unique
  3. Attributes are atomic in nature
  4. No repeating groups of attributes exist

Each row has the same number and structure of columns.

1st Normal Form

1st Normal Form has four basic requirements:

  1. Records have a consistent shape
  2. All instances of an entity must be unique
  3. Attributes are atomic in nature
  4. No repeating groups of attributes exist


We should use primary keys and unique constraints (on non-primary candidate keys) to guarantee uniqueness.

Simply having a surrogate key does not make a record unique! We should have a unique natural key.

1st Normal Form

1st Normal Form has four basic requirements:

  1. Records have a consistent shape
  2. All instances of an entity must be unique
  3. Attributes are atomic in nature
  4. No repeating groups of attributes exist

There is a single value in every single attribute of every single instance of an entity.

More On Atomicity

Atomicity focuses on the lowest level of analysis for a particular attribute.


Thought experiment: is "(614) 555-1234" atomic?

  • Do you determine geographical region based on area code and exchange?
  • Do you intend to analyze the parts of phone numbers as opposed to the whole?
  • Do you accept different types of phone numbers? Many foreign cell phones have 8-digit numbers, plus country and area code.

More On Atomicity

Based on our thought experiment, we can see that an attribute may be atomic for one system but not for another. Here is another example: a mailing label might be okay with the name "Bob Smith," but our form letter should be addressed to "Dr. Smith." If this is true, we should have more than just a Name field!


A common mistake is to assume that the database structure should mimic the user interface (or vice versa). Correct database design focuses not only on how people enter the data, but also how people will later use the data and what the data actually represent.

1st Normal Form

1st Normal Form has four basic requirements:

  1. Records have a consistent shape
  2. All instances of an entity must be unique
  3. Attributes are atomic in nature
  4. No repeating groups of attributes exist


An example of a repeating group of attributes is multiple payment records: Payment1, Payment2, etc.

In this case, we have the same attribute (Payment) show up in multiple guises.

Thought experiment: would HomeTelephoneNumber, WorkTelephoneNumber, and CellPhoneNumber be a repating group of attributes?

1NF No-Nos

Here are some common 1NF problems and how to fix them.

Customer table
Name
Payment1
Payment2

Payment1 and Payment2 form a repeating group.

How to fix:

Create a CustomerPayment table with the Customer primary key, plus a single Payment

1NF No-Nos

Here are some common 1NF problems and how to fix them.

Customer 1 Customer 2
NameName
AddressAddress
Hair ColorPhone Number

Two instances of the customer entity have differing attributes.

This violates the principle of shape: each instance of our entity (row in our table) should have the same number and structure of columns.

How to fix:

Not a problem in SQL Server!

It may be OK to have both attributes on the Customer entity, or you might create a sub-class relationship.

1NF No-Nos

Here are some common 1NF problems and how to fix them.

Name Phone Number
Bob Smith(614) 555-BOBB
Bob Smith(614) 555-BOBB

Each instance of an entity should be unique. Adding a surrogate key does not fix this!

How to fix:

Find a unique set of attributes for your natural key. Add attributes if necessary.

Sometimes, there is no natural key:

  • Mass-produced goods may differ only by a surrogate product key
  • Multiple, unique log entry events may fire simultaneously

1NF No-Nos

Here are some common 1NF problems and how to fix them.

Name XMLBlob
Bob Smith<Employee Man#="123" Birthdate="Nov.22">

XML is okay if you treat it as an atomic value. Once you start shredding it inside SQL Server as part of regular application logic, you violate 1NF.

How to fix:

  • Create appropriate entities and attributes
  • Shred in the business or application layer

Summary: Why Use 1NF?

  • Eliminate certain types of update anomalies

Given columns Payment1 and Payment2, what happens when Payment1 is removed? What if there's a third payment? What if Payment2 has a value but Payment1 doesn't?

Summary: Why Use 1NF?

  • Eliminate certain types of update anomalies
  • Handle multiple attribute values more efficiently

Given a Person table with an e-mail address like like bob@smith.com;bobsmith@hotmail.com, suppose Bob moves to GMail. Code to make this update:


UPDATE dbo.Person
SET EmailAddress = REPLACE(EmailAddress, 'bobsmith@hotmail.com', 
			'bobsmith@gmail.com') 
WHERE ';' + EmailAddress + ';' LIKE '%;bobsmith@hotmail.com;%'
						

This code is harder to read, more difficult to maintain, and impossible to index.

Not normalizing quickly leads to ugly SQL.

Summary: Why Use 1NF?

  • Eliminate certain types of update anomalies
  • Handle multiple attribute values more efficiently
  • Reduce awkwardness

Violations of 1NF make it harder to store, retrieve, and manipulate data. This leads to more complicated, worse-performing code.

Hints That You May Not Be In 1NF

  • Delimiters included in records (e.g., comma, semi-colon, pipe)

This is not always the case: think of narrative text. Also, sometimes, a list is a single element from the system's perspective. For example, we may not need to split out credentials after a person's name if our system never analyzes based on individual credentials.

Hints That You May Not Be In 1NF

  • Delimiters included in records (e.g., comma, semi-colon, pipe)
  • Attribute name with a number at the end (e.g., Phone1, Phone2)

Question for discussion: what about address lines 1 and 2? What if our address table has a line 3?

Hints That You May Not Be In 1NF

  • Delimiters included in records (e.g., comma, semi-colon, pipe)
  • Attribute name with a number at the end (e.g., Phone1, Phone2)
  • Your queries do a lot of LIKE '%*%' parsing

Parsing indicates that you're interested in part of the contents of an attribute, which is a strong indicator that you may want to split this out into multiple attributes or entities.

2nd Normal Form

We are in 2nd Normal Form if we are in 1st Normal Form and each non-key attribute depends upon the whole candidate key.

Quick reminder: we focus on natural keys, not surrogate keys. Even if we have a surrogate key on a table, we normalize against natural candidate keys.

2NF Example

Not In 2NF

IT Specialist Project Task Assigned Login Department
Bob Funding Re-write app 2013-07-18 BobS Finance
Ken R Install 2013-08-15 KenH Research
Ken Funding Create server 2013-07-18 KenH Finance

IT Specialist + Project + Task is a valid candidate key for this project task table. Not all attributes describe IT Specialist + Project + Task, however: Login is an attribute only for an IT Specialist, and projects are per-Department (in our example).

2NF Example

In 2NF

IT Specialist Login
Bob BobS
Ken KenH
Ann AnnE
Project Department
Funding Finance
R Research
IT Specialist (FK) Project (FK) Task Assigned
Bob Funding Re-write app 2013-07-18
Ken R Install 2013-08-15
Ken Funding Create server 2013-07-18

In this case, we define a Task as being assigned to one person. If multiple people may be assigned to the same task, the structure changes by adding a Task table and adding another foreign key.

Why Use 2NF?

  • Interview with an anomaly

Suppose we have an application which lets you edit a record. Suppose further that Ken's login is wrong and should be KenS instead of KenH. We update the R task and hit Save. Everything's good, right?

Why Use 2NF?

  • Interview with an anomaly
  • Discom-bob-ulation

Bob leaves, but we still need to re-write the Funding app and we aren't sure who will do the re-write at the time he leaves. If we delete the record, we no longer know in the application that we need to re-write Funding.

Why Use 2NF?

  • Interview with an anomaly
  • Discom-bob-ulation
  • Duplication

Now we have two people assigned to a Task. Bob and Ken are assigned to the same task on the same day. They then leave to start a synchronized swimming team. Ann takes over the task by herself. If we update the records to show that Ann has this task, we will end up with duplicate records (violating 1NF).

Why Use 2NF?

  • Interview with an anomaly
  • Discom-bob-ulation
  • Duplication
  • Disappearance

Speaking of Ann, where was she in our first example? She had no tasks assigned, and so could not exist according to our system.

Hints That You May Not Be In 2NF

  • Repeating part of a key prefix in other attributes

In your Book table, you have AuthorFirstName, AuthorLastName, AuthorSocialSecurityNumber, and BookISBN. First name, last name, and SSN belong to the author, not to the book.

Hints That You May Not Be In 2NF

  • Repeating part of a key prefix in other attributes
  • Repeating combinations of attributes

Whenever you see one value in attribute X, you see a particular value in attribute Y. Attribute Y has a functional dependency on attribute X if the value of attribute X determines the value of attribute Y.

In our 2NF example, Bob as IT Specialist always has BobS as the Login. Login is functionally dependent upon IT Specialist because you create a login for an IT specialist.

Hints That You May Not Be In 2NF

  • Repeating part of a key prefix in other attributes
  • Repeating combinations of attributes
  • Composite keys without foreign keys

In your Book table, you have AuthorSocialSecurityNumber, but no Author table.

This may lead to the occasional false positive. Think of a phone number: its primary key has CountryCode, AreaCode, Exchange, Number, and Extension, but we don't need tables going back to five separate tables. If we added an AreaCodeGeographicLocation, however, we would see a 2NF violation because it depends upon part of the primary key.

Hints That You May Not Be In 2NF

  • Repeating part of a key prefix in other attributes
  • Repeating combinations of attributes
  • Composite keys without foreign keys
  • You write code designed to fix 2NF problems

Examples of 2NF-fixing code:

  • Saving one row involves potentially updating multiple rows
  • Dummy values exist to make sure that some attribute values don't disappear

3rd Normal Form

An entity is in 3rd Normal Form if that entity is in 2nd Normal Form and there are no transitively dependent non-prime attributes.

In other words, an entity is in violation of 3NF if there is a non-key attribute which is a fact about another non-key attribute.

3NF Example

Employee
EmployeeNumber(PK)
Name
Department
DepartmentLocation

In the example above, the DepartmentLocation attribute describes Department, which is a non-key attribute of the Employee entity.

To fix this problem, we need to create a Department entity and move DepartmentLocation out to it.

Why Use 3NF?

  • Relocation

If the department moves, you need to update a number of records. If the entity is not in 3NF, you may miss some records on accident.

Why Use 3NF?

  • Relocation
  • Representation

In the previous example, you cannot represent departments or locations without employees. What about a new department which doesn't have any employees yet? Do you create a dummy employee record? Note that this is very similar to a 2NF problem, except this time the problem is that our non-key attributes need dummy data to exist.

Hints That You May Not Be In 3NF

  • Multiple attributes with the same prefix

For example, PublisherName and PublisherCity on the Book table.

Warning: the prefix may not be there, or might be different. For example, a Book table could have attributes named LocationPub, PubName, PblCity, PublicationState, and ZIPCode. All of these really relate to a publisher, even though they don't have the same prefix.

Hints That You May Not Be In 3NF

  • Multiple attributes with the same prefix
  • Repeating groups of data

Each Book has PublisherName and PublisherLocation. Whenever PublisherName is "Red Gate," PublisherLocation is "London." PublisherLocation has a functional dependency upon PublisherName, and this is a 3NF violation.

Warning: if, for a given PublisherName, some PublisherLocation values are "123 Main St., New York" and others are "123 Main Street, NYC" we have logically equivalent results, but technically no functional dependency--the locations are technically "different." This is still a violation of 3NF, even without a true functional dependency.

Boyce-Codd Normal Form

Boyce-Codd Normal Form (BCNF) generalizes 2NF and 3NF. In order to have an entity be in BCNF...

  • The entity is already in 1NF

Each normal form builds upon prior normal forms. BCNF does not build on 2NF or 3NF, but instead builds on 1NF.

Boyce-Codd Normal Form

Boyce-Codd Normal Form (BCNF) generalizes 2NF and 3NF. In order to have an entity be in BCNF...

  • The entity is already in 1NF
  • All attributes are fully dependent upon a key

The key differentiator between BCNF and 2NF/3NF is that BCNF focuses on multiple candidate keys. When there is only one candidate key, 2NF + 3NF = BCNF, but with multiple candidate keys, BCNF handles scenarios that 3NF would fail at.

Boyce-Codd Normal Form

Boyce-Codd Normal Form (BCNF) generalizes 2NF and 3NF. In order to have an entity be in BCNF...

  • The entity is already in 1NF
  • All attributes are fully dependent upon a key
  • Every determinant is a key

A determinant is: Any attribute or combination of attributes on which any other attribute or combination of attributes is functionally dependent

In other words, if attribute Z depends upon attributes X and Y, attributes X and Y form a determinant. If the combination of X and Y are not a candidate key, we are not in BCNF.

BCNF Example

Student
SSN (PK)
StudentName
Address
HighSchoolCode (PK)
HighSchoolName
HighSchoolCity
GPA
Priority

In this scenario, we have a Student entity with several attributes. Our only key is { SSN, HighSchoolCode }. We have three functional dependencies:
1) SSN --> StudentName, Address, GPA
2) GPA --> Priority
3) HighSchoolCode --> HighSchoolName, HighSchoolCity

Are we in BCNF? If so, every left-hand side of a functional dependency is a key. Answer: No. None of our functional dependencies describe our candidate key!

BCNF Example

HighSchool
HighSchoolCode (PK)
HighSchoolName
HighSchoolCity
LeftoverStudent
SSN (PK)
StudentName
Address
HighSchoolCode (PK)
GPA
Priority

To get to BCNF, we pick a functional dependency and create two relations, one with the functional dependency, plus one with the rest + LHS.
HighSchool has a key of HighSchoolCode and a functional dependency of HighSchoolCode --> HighSchoolName, HighSchoolCity. HighSchool is in BCNF.
LeftoverStudent still needs some work.

BCNF Example

HighSchool
HighSchoolCode (PK)
HighSchoolName
HighSchoolCity
GPA
GPA (PK)
Priority
LeftoverStudent
SSN (PK)
StudentName
Address
HighSchoolCode (PK)
GPA

To get to BCNF, we pick a functional dependency and create two relations, one with the functional dependency, plus one with the rest + LHS.
GPA has a key of GPA and a functional dependency of GPA --> Priority. GPA is in BCNF.
LeftoverStudent still needs some work.

BCNF Example

HighSchool
HighSchoolCode (PK)
HighSchoolName
HighSchoolCity
GPA
GPA (PK)
Priority
Student
SSN (PK)
StudentName
Address
GPA
StudentSchool
SSN (PK)
HighSchoolCode (PK)

Student has a key of SSN and a functional dependency of SSN --> StudentName, Address, GPA.
StudentSchool has a key of SSN, HighSchoolCode.
At this point our model is in BCNF.

BCNF Example #2

BugID Tag TagType
1234 Crash Impact
3456 Printing Subsystem
3456 Crash Impact
5678 Report Subsystem
5678 Crash Impact
5678 Data Fix
Candidate Keys
{ BugID, Tag }
{ BugID, TagType }

Functional Dependencies
{ BugID, Tag } --> TagType
{ BugID, TagType } --> Tag

BCNF is "stricter" than 3NF. 3NF tells us to strip out columns not functionally dependent upon the primary key. BCNF tells us that all non-key attributes depend upon the whole key, not just part of it.

In our example above, we choose { BugID, Tag } as the PK.

BCNF Example #2

BugID Tag TagType
1234 Crash Impact
3456 Printing Subsystem
3456 Crash Impact
5678 Report Subsystem
5678 Crash Impact
5678 Data Fix
BugID Tag TagType
1234 Crash Impact
3456 Printing Subsystem
3456 Crash Impact
5678 Report Subsystem
5678 Crash Subsystem
5678 Data Fix

Our 3NF table fails to protect us against an update anamoly--remember that BugID and TagType should be unique!

BCNF Example #2

BugID Tag
1234 Crash
3456 Printing
3456 Crash
5678 Report
5678 Crash
5678 Data
Tag TagType
Crash Impact
Printing Subsystem
Report Subsystem
Data Fix
Functional Dependencies
{ Tag } --> TagType

To make this BCNF-friendly, we need to split the entity into two entities. Now we can protect against that type of update anomaly. Notice the new functional dependencies!

BCNF Example #3

Number Member Advisor
101 Johnson Brown
101 Michaels Brown
101 Stewart Graham
492 Johnson Thomas
492 Altman Thomas
492 Michaels Andrews
Candidate Keys
{ Number, Member }
{ Member, Advisor }

Functional Dependencies
{ Number, Member } --> Advisor
{ Member, Advisor } --> Number

This is a valid 3NF design: all functional dependencies include prime columns on the left-hand side, regardless of which primary key we choose.

BCNF Example #3

Number Member Advisor
101 Johnson Browning
101 Michaels Browning
101 Stewart Graham
492 Johnson Thomas
492 Altman Thomas
492 Michaels Andrews
Candidate Keys
{ Number, Member }
{ Member, Advisor }

Functional Dependencies
{ Number, Member } --> Advisor
{ Member, Advisor } --> Number

Think about data problems we could have:

  1. Brown marries and becomes Browning --> Need to update 2 rows.
  2. Stewart resigns from Project 101; how can we know Graham is still an advisor on this project?

BCNF Example #3

Advisor (PK) Project
Brown 101
Graham 101
Thomas 492
Andrews 492
Advisor (PK) Member (PK)
Brown Johnson
Brown Michaels
Graham Stewart
Thomas Johnson
Thomas Altman
Andrews Michaels
Member (PK)
Johnson
Michaels
Stewart
Altman

We fixed the data problems, but introduced a new issue: we had a business rule saying that, on a given project, one member works only with one advisor. Now we have no easy way of maintaining this rule.

The Final Normal Forms

4NF and 5NF deal with multi-valued facts (i.e., dependencies in entities) and composite keys. The driving principle behind 4NF and 5NF is to minimize the number of fields involved in a composite key.

4th Normal Form

To be in 4th Normal Form, an entity should be in BCNF and a record type should not contain two or more independent, multi-valued facts about an entity. In other words, no more than one attribute can store multiple values that relate to a key in any entity. Don't repeat single-valued attributes for each multi-valued attribute.

You know you are in 4NF when any ternary relationship decomposition is lossy when broken down into two binary relationships.

4NF Example

ClassAssignment
Class Teacher Student
101 Janet Mark
101 Janet Edgar
101 Janet Marie
201 Fred Jake
201 Fred Wilhelm
201 Fred Elaine
ClassAssignment
Class Teacher Student
101 Janet Mark
101 Janet Edgar
101 Janet Marie
201 Janet Jake
201 Fred Wilhelm
201 Fred Elaine

The problem is that if we want to change a particular class's teacher, we need to update more than one record. This can lead to an update anomaly, with some students having the correct teacher and some not.

Fred went on sabbatical, so Janet is covering the 201 course. Unfortunately, we missed some records due to an update anomaly.

4NF Example

ClassTeacher
Class
Teacher
ClassStudent
Class
Student

Now we can update class information, class teachers, and class students easily (i.e., we update one record). This eliminates that class of update anomaly.

This table structure also helps us if we ever have two-teacher classes.

Indicators of 4NF Violation

  • Ternary Relationships

RoomSessionPresenter
Room
Session
Presenter
Time

Duplicate data can occur if two people do the same presentation session in the same room, or if one person does the same session in two rooms.

Breaking this down into tables (e.g., RoomPresenter and RoomSession) can be a little tricky. If you break down the tables wrong, you might get lossy decomposition. You need to be able to re-build your pre-4NF results from your 4NF tables, and based upon how you define your junction tables, this may not be possible.

4th Normal Form

Indicators of 4NF Violation

  • Ternary Relationships
  • Lurking multi-valued attributes

PersonAddress
PersonID
LocationAddressID
MailingAddressID
BillingAddressID

A Person may have several addresses or telephone numbers. Suppose we have a table like on the left. There are several potential problems here:

  • Can a person have two location addresses?
  • Can a person have no mailing address?
  • What if we have a new type of address we need to add?

4th Normal Form

Indicators of 4NF Violation

  • Ternary Relationships
  • Lurking multi-valued attributes

PersonAddress
PersonID
AddressID
Type

We get rid of the multi-valued dependency. We can now have many addresses independent from one another, and can add new types without re-defining the entity structure.

Retrieving data from this structure is a little bit harder, however--we need to join three tables together rather than two.

4th Normal Form

Indicators of 4NF Violation

  • Ternary Relationships
  • Lurking multi-valued attributes
  • Status & other attributes for which we need to know previous values

Order
OrderID
OrderStatusTypeID
TakenDate
VerifiedDate
SentToShippingDate
FulfilledDate
BilledDate
PaymentReceivedDate

This table complies with BCNF, but there can be some problems:
- How do we note that verification failed once and was corrected?
- What values do we enter if an order was cancelled?
- What if payment is received before the order is fulfilled? What is the status then?

4th Normal Form

Indicators of 4NF Violation

  • Ternary Relationships
  • Lurking multi-valued attributes
  • Status & other attributes for which we need to know previous values

Order
OrderID (PK)
OrderStatusType
OrderStatusTypeID (PK)
Name (AK1)
OrderStatus
OrderStatusID (PK)
OrderStatusTypeID (FK) (AK1)
OrderID (FK) (AK1)
EffectiveDate (AK1)

This table structure is now 4NF-compliant. We can satisfactorily answer our previous questions.

4NF Example

Bad Solution

Mr. Smith can cook and type, and is fluent in German, French, and Greek. These are independent, multi-valued attributes. How do we best represent this?

Employee Skill Language
Smith Cook
Smith Type
Smith French
Smith German
Smith Greek

The disjoint format leads to several questions.
- What do the blanks actually mean?
- What if a person doesn't know any foreign languages?

4NF Example

Bad Solution

Mr. Smith can cook and type, and is fluent in German, French, and Greek. These are independent, multi-valued attributes. How do we best represent this?

Employee Skill Language
Smith Cook French
Smith Type German
Smith Greek

The Minimal (with NULL) format has its own problems.
- Does this mean that Smith can cook French cuisine and type German?
- What about Greek?
- What if he forgets German?

Remember that these are independent attributes! This table structure obscures reality and confuses the reader.

4NF Example

Bad Solution

Mr. Smith can cook and type, and is fluent in German, French, and Greek. These are independent, multi-valued attributes. How do we best represent this?

Employee Skill Language
Smith Cook French
Smith Type German
Smith Type Greek

The Minimal (with repetition) format has its own problems.
- Does this mean that Smith can cook French cuisine and type in German and Greek?
- What if he forgets German?
- What if he forgets how to type?
- What if he forgets French?

Remember that these are independent attributes! This table structure obscures reality and confuses the reader.

4NF Example

Bad Solution

Mr. Smith can cook and type, and is fluent in German, French, and Greek. These are independent, multi-valued attributes. How do we best represent this?

Employee Skill Language
Smith Cook French
Smith Type
Smith German
Smith Type Greek

The unrestricted format has major problems. Think about how you would even begin to retrieve this data. We have NULL values, repetition, and confusing combinations.

4NF Example

Good Solution

Mr. Smith can cook and type, and is fluent in German, French, and Greek. These are independent, multi-valued attributes. How do we best represent this?

Employee Skill
Smith Cook
Smith Type
Employee Language
Smith French
Smith German
Smith Greek

Even though skill and language are both dependent upon an employee (and thus OK for 3NF and BCNF), we want to separate these two multi-valued facts into separate tables. This minimizes the number of records necessary and prevents the above confusion.

Why Use 4NF?

  1. If there are repetitions, updates need to be done in multiple records.
  2. Inserting a new skill may involve looking for a blank skill record, inserting a new record (with a blank language?), or even inserting multiple records pairing the new skill with some or all languages.
  3. Deleting a skill means checking languages, deleting one or more records, or blanking out the skill if this is the last mention of a particular language.

This is painful! Why bring this upon yourself when there's an easy, correct way to do it?

Hints That You May Not Be In 4NF

  • Multiple "similar but not the same" fields

Home and Work phone number would violate 4NF even if we know that a person would never have more than one home/work number. This is OK for 3NF if we assume that these are independent attributes rather than flattening out a separate relationship.

Hints That You May Not Be In 4NF

  • Multiple "similar but not the same" fields
  • Tables with independent, multi-valued attributes


We went through two separate examples of this:

  1. An Employee knows zero or more Languages and has zero or more Skills.
  2. A Class has at least one Teacher and zero or more Students.

5th Normal Form

To be in 5th Normal Form, an entity should be in 4th Normal Form and all decomposable entities are, in fact, decomposed.

The idea is that ternary relationships are safer if you can break them down into binary relationships. The problem is that not all ternary relationships are, in fact, decomposable.

In practice, 5NF differs from 4NF only when there exists a symmetric constraint.

5th Normal Form

In practice, 5NF differs from 4NF only when there exists a symmetric constraint.

For example, suppose we have an Agent, a Company, and a Product.

A symmetric constraint means:

  • IF an Agent (A) sells a Product (P)...
  • AND (A) represents a Company (C)...
  • AND (C) offers (P)...
  • THEN (A) sells (P) for (C)

5NF Example

Agent Company Product
Smith Ford Car
Smith GM Truck

In this setup, if all three fields are necessary to determine which combinations are valid and which are not, then we are already in 5NF.

5NF Example

Agent Company Product
Smith Ford Car
Smith Ford Truck
Smith GM Car
Jones Ford Car

With this data set, we see two agents (Smith and Jones) who sell various products for Ford and GM. Ford manufactures cars and trucks, whereas GM only manufactures cars. If we have a rule in which company salesmen sell all company products, then we can decompose this entity further...

5NF Example

Agent Company
Smith Ford
Smith GM
Jones Ford
Agent Product
Smith Car
Smith Truck
Jones Car
Company Product
Ford Car
Ford Truck
GM Car
GM Truck

As long as our symmetric constraint holds, everything is fine. But what happens if GM starts to sell trucks?

If our symmetric constraint holds, this means that Smith must now sell GM trucks.

If a symmetric constraint does not hold, recomposing the three-table set may give us unexpected results.

This is why 5NF is hard: you really need to understand entity relationships well to get this right.

5NF Failure

Teacher Student Class
Bob Louis Normalization
Bob Fred T-SQL
Larry Fred Normalization

Can we decompose this entity into three separate relations? Let's give it a try!

5NF Failure

Teacher Student
Bob Louis
Bob Fred
Larry Fred
Student Class
Louis Normalization
Fred T-SQL
Fred Normalization
Teacher Class
Bob Normalization
Bob T-SQL
Larry Normalization

We broke this ternary relationship down into three separate binary relationships. Mission accomplished, right?

5NF Failure

Now let's retrieve the data... First, the original data set:


SELECT
	Teacher,
	Student,
	Class
FROM
	Course;	
						
Teacher Student Class
Bob Louis Normalization
Bob Fred T-SQL
Larry Fred Normalization

5NF Failure

Now, the "normalized" data set:


SELECT
	ts.Teacher,
	ts.Student,
	sc.Class
FROM
	#TeacherStudent ts
	INNER JOIN #StudentClass sc ON ts.Student = sc.Student
	INNER JOIN #TeacherClass tc ON ts.Teacher = tc.Teacher 
		AND sc.Class = tc.Class;	
						
Teacher Student Class
Bob Louis Normalization
Bob Fred T-SQL
Larry Fred Normalization
Bob Fred Normalization

We have a phantom row when we try to re-compose this data. According to our table,
- Bob taught Fred
- Bob taught Normalization
- Fred took Normalization
Therefore, Bob must have taught Fred about Normalization. Remember symmetric constraints!

5NF Success

Agent Company Product
Smith Ford Car
Smith Ford Truck
Smith GM Car
Smith GM Truck
Jones Ford Car
Jones Ford Truck
Brown Ford Car
Brown GM Car
Brown Toyota Car
Brown Toyota Bus

At this point, you may be thinking that 5th Normal Form isn't worth the hassle. There are some cases in which it can be quite helpful, however.

This particular data set is in 4NF because you cannot decompose it into 2 tables. It is not in 5NF, however, because you can decompose it into 3 tables.

5NF Success

Agent Company
Smith Ford
Smith GM
Jones Ford
Brown Ford
Brown GM
Brown Toyota
Company Product
Ford Car
Ford Truck
GM Car
GM Truck
Toyota Car
Toyota Bus
Agent Product
Smith Car
Smith Truck
Jones Car
Jones Truck
Brown Car
Brown Bus

As long as we have a symmetric constraint in place, we can break this relationship down into three tables, ensuring that we do not have to worry about fulfilling our symmetric constraint.

Hints That You May Not Be In 5NF

  • Business rules limiting associations in ternary entity relationships exist

If we have rules which mandate a symmetric constraint (like the salesman problem) and you have business logic to add multiple rows to your ternary relationship, you probably have a 5NF violation.

If only one person may teach a specific course, our ternary relationship can successfully break down into three binary relationships.

"Overnormalization"

Overnormalization isn't really a thing. There are two scenarios, however, which may look like overnormalization.

  • Legitimate business flexibility

Suppose we have a Product with a Cost attribute. Suppose we also have an InvoiceLineItem, which lists a Product as well as a ProductCost. We see this and want to get rid of ProductCost as a violation of 3NF.

This might be correct, if ProductCost really is functionally dependent upon Product.

But let's say a manager has the flexibility to change product costs on particular invoices. In this case, we need to store the product cost for an invoice line item. This is no longer a functional dependency, but instead ProductCost is an attribute of the invoice line item itself.

"Overnormalization"

Overnormalization isn't really a thing. There are two scenarios, however, which may look like overnormalization.

  • Legitimate business flexibility
  • Lossy decomposition and phantom rows

Trying to get beyond Boyce-Codd Normal Form may lead to lossy decomposition (where you are unable to re-create the old data from your "normalized" tables) or phantom rows (where you have additional rows upon re-creation). If you cannot get back to your initial, pre-"normalized" data set, then your candidate table structure is not "over-normalized;" it is broken.

When Am I Done?

  • Columns: one column has one value

This should be relatively easy: you should not store multiple values within a single attribute. If you need to parse, split, or shred a column's value within the database layer as part of normal business operation, you haven't done this right.

Concept adapted from a Louis Davidson blog post

When Am I Done?

  • Columns: one column has one value
  • Tables and rows are unique

Each table should have an independent meaning, and each row should be distinct from every other row.

Concept adapted from a Louis Davidson blog post

When Am I Done?

  • Columns: one column has one value
  • Tables and rows are unique
  • Proper relationships between columns

A column is a key, part of a key, or describes something unique about the row identified by the key. Otherwise, that column doesn't belong.

Concept adapted from a Louis Davidson blog post

When Am I Done?

  • Columns: one column has one value
  • Tables and rows are unique
  • Proper relationships between columns
  • All dependencies scrutinized

All relationships are valid. Focus particularly on ternary relationships. Reduce all relationships to binary form whenever possible.

Concept adapted from a Louis Davidson blog post

Resources