Perfectly Normal

An Advanced Primer on Normalization

Kevin Feasel (@feaselkl)
http://CSmore.info/on/dbdesign

Who Am I? What Am I Doing Here?

Is This Talk Right for Me?

Have you ever believed any of the following?

  • "Normalization isn't relevant in the business world."
  • "Normalize until it hurts, denormalize until it works."
  • "The main purpose of normalization is to minimize disk utilization by removing duplicate data."
  • "Database design doesn't matter--add more hardware and that will fix the problem."
  • "Getting to Third Normal Form is all you need to know about normalization."
  • "Third Normal Form is important."

Motivation

My goals in this talk:

  • Explain the basics behind database normalization.
  • Dive into the two most important normal forms.
  • Briefly cover other interesting normal forms.
  • Cover critiques of normalization.

A Brief Warning

Normalization is a part of relational theory, meaning that there will be math and there will be discussions of formal logic.

I will do my best to minimize the learning curve, but we're ultimately solving math problems using logical analysis.

Also, some of these normal forms are not well-understood outside of academic circles and so your interpretation of the forms may not match what I'm going to describe.

The Source

The source of most of today's talk is C.J. Date's Database Design and Relational Theory, 2nd Edition (Apress, 2019).

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

What is Normalization?

Normalization is a set of logical rules and processes to follow.

Normalization works off of a logical data model. From there, we can (theoretically) adapt to a physical data model for implementation.

Quick Primer on Terminology

We're used to thinking in terms of tables, rows, and columns. Let's take a step back and talk about the data modeling terms.

Goals of Normalization

  • Remove redundant data
  • Fix incorrect designs
  • Achieve a design which is a good representation of the real world
  • Simplify the statement and enforcement of certain integrity constraints

The Normal Forms

From a talk I used to deliver:

The Normal Forms

The Normal Forms We Will Care About

Normal Forms Ranked by Importance

  • Most important: BCNF, 5NF
  • Sometimes important: 6NF
  • Overrated: 1NF, 4NF
  • Rarely relevant: EKNF, ETNF, RFNF, SKNF, Overstrong PJ/NF
  • Never relevant: 2NF, 3NF

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

What 1NF Is

First normal form (1NF) is all about key rules and the shape of tuples.

Given relvar R with heading H containing attributes A1...An of types T1...Tn, all tuples follow heading H and have one value of type Ti for attribute Ai.

In Other Words

UserName (CHAR) Password (BINARY) IsActive (BIT)
Bob 0xABC123 1
Jane 0xCDA331 1
Marcus 0xFFF048 0

Adding a new tuple t(Jim, 0xBBA492, 1) requires that the new tuple's heading, with attributes and types, matches the existing relvar's heading.

The Rules on 1NF

First normal form (1NF) has a few rules:

  • New tuples match the existing relvar's heading.
  • All tuples must have values for all attributes. This means NULL violates 1NF!
  • No duplicate tuples are allowed
  • Attribute order does not matter for tuples
  • Tuple order does not matter for relations
  • All attributes are regular: they have a name, a type, are not hidden, etc.

Repeating Groups

Quite commonly, depictions of 1NF will discuss the concept of "repeating groups."

An example of a relvar with a repeating group is:

R = (Name, Phone1, Phone2, Phone3)

Is this in 1NF?

From a talk I used to deliver:

What We Know That Just Isn't So

R = (Name, Phone1, Phone2, Phone3)

The answer is YES! This relvar is in 1NF!

"Repeating groups" is a term Date came up with and has subsequently recanted. The term itself was confusing and the above interpretation is something he did not intend. His latest advice is to ignore any reference to repeating groups.

Are SQL Tables in 1NF?

SQL tables can be in 1NF.

Also, SQL tables are trivially not in 1NF because they can't otherwise be.

  • New tuples match the existing relvar's heading.
  • All tuples must have values for all attributes. NULL violates 1NF!

  • No duplicate tuples are allowed

  • Attribute order does not matter for tuples

  • Tuple order does not matter for relations
  • All attributes are regular: they have a name, a type, are not hidden, etc.

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

The Importance of BCNF

Boyce-Codd Normal Form is one of the two most important normal forms. It prevents a variety of data anomalies:

  • Update anomalies
  • Redundant data (some types)
  • Missing data

We Have Data

Specialist Project Task Login
Julia ETL Write scripts JuliaH
Lindsay Analysis Build spark cluster LindsayM
Lindsay ETL Write scripts LindsayM

Update Anomalies

Lindsay gets married and changes her last name. But we miss it in one project.

Specialist Project Task Login
Julia ETL Write scripts JuliaH
Lindsay Analysis Build spark cluster LindsayJ
Lindsay ETL Write scripts

LindsayM

Redundant Data

Ann comes off the bench to do all the ETL work. Let's update the database.

Specialist Project Task Login
Ann ETL Write scripts AnnB
Lindsay Analysis Build spark cluster LindsayJ
Ann ETL Write scripts AnnB

Missing Data

Where was Ann? How do we make her visible when she's on the bench?

Specialist Project Task Login
Julia ETL Write scripts JuliaH
Lindsay Analysis Build spark cluster LindsayJ
Lindsay ETL Write scripts LindsayJ
Ann NONE Placeholder AnnB

Boyce-Codd Normal Form Defined

We can define BCNF as:

All functional dependencies have superkey determinants.

Superkey: a unique combination of attributes, but not necessarily the most parsimonious unique combination of attributes. If (A1, A2) is a key, (A1) is a subkey and (A1, A2, A3) is a superkey.

Functional Dependencies

Suppose we have some attribute Z.

The value of Z depends on inputs X and Y:

X Y Z
1 4 7
3 5 2
1 4 7
6 8 7

Functional Dependencies

When we see X=1 and Y=4, we know Z=7.

In other words, X and Y determine Z, or (X, Y) is the determinant for Z.

In mathematical terms, we can write this as (X, Y) -> Z.

Boyce-Codd Normal Form Reviewed

All functional dependencies have superkey determinants.
  • Superkey: key and maybe more.
  • Functional dependency: strict relationship between attributes.
  • Determinant: "input(s)" for the functional dependency.

Relvars in BCNF are guaranteed to suffer from none of the anomalies we just described!

Finding Determinants

"Deductive" techniques:

  • Look for common prefixes, such as UserID and UserName
  • Look at what "belongs" to an object, e.g., a password on a User relvar versus a password on a UserLoginAttempt relvar
  • Look for equivalences and mathematical formulas, e.g., SubTotal, Tax, Total
  • Discuss with product owners and stakeholders what things belong together

Finding Determinants

"Inductive" techniques:

  • Look for common patterns in the data
  • Count of distinct values for attribute combinations
  • Build a correlation matrix for combinations of attributes

The Solution, Step 1

Start with a relvar whose heading has all attributes. Figure out the candidate key(s) and functional dependencies for this relvar.

CampaignID CName, COName, AgencyID, MaxBudget, MinBid
AgencyID AName, TakeRate
AudienceTargetID TAgeRng, TGenPref
CampaignID, AudienceTargetID AudienceBidModifier

The Solution, Step 2

Next, ask this question: are all determinants candidate keys? The answer is obviously no. Only the final determinant is.

Determinant Dependency
CampaignID CampaignName, CampaignOwnerName, AgencyID, MaximumDailyBudget, MinimumBid
AgencyID AgencyName, TakeRate
AudienceTargetID TargetAgeRange, TargetGenrePreference
CampaignID, AudienceTargetID AudienceBidModifier

Intermission: Heath's Theorem

Let's introduce Heath's theorem here:

Given a relvar R with components { X, Y, Z }, as well as functional dependency X --> Y, we can break R into two relvars, containing { X, Y } and { X, Z }, without loss.

The Solution, Step 3

If the answer to Step 2 was no, take a non-candidate key determinant and break it out into its own relvar, following Heath's theorem.

The Solution, Again

Repeat steps 2 and 3 as long as there are more functional dependencies which are not candidate keys. Let's take CampaignID next.

The Solution, Again Again

Our remaining functional dependencies are:

Determinant Dependency
AudienceTargetID TargetAgeRange, TargetGenrePreference
CampaignID, AudienceTargetID AudienceBidModifier

This is not good enough! AudienceTargetID is a subkey of the candidate key, not a superkey! Therefore, break it out.

The End Result

Here is our final data model, in BCNF. Also, I took the opportunity to rename the DM Campaign relvar to CampaignTarget, which better represents its meaning.

Where BCNF Breaks Down

Boyce-Codd Normal Form is great for most scenarios, but it does fall apart in one particular case: when there are cycles in functional dependencies.

Suppose we have a relvar with heading { S, J, T }. These attributes represent Students, Subjects, and Teachers, respectively.

Given a functional dependency T -> J, we can break this out into { S, T } and { T, J }

Where BCNF Breaks Down

Relvar with heading { S, J, T }

Functional dependency T -> J

Now if we include a functional dependency S, J -> T, it all falls apart.

{ S, T } + { T, J } doesn't work--T also depends on { S, J }!

But T is not a superkey of { S, J, T } so all keeping all three won't work either.

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

Fifth Normal Form (5NF)

Fifth Normal Form is the second most-important normal form. What BCNF is to functional dependencies, 5NF is to join dependencies

Relvar R is in 5NF if and only if every join dependency in R is implied in the keys of R.

Join Dependency: if X1, X2, ..., Xn are the subsets of heading H on relvar R, a join dependency holds in R if and only if R can be non-loss decomposed into its projections on X1, X2, ..., Xn.

A Quick Example

Suppose we have a relvar called Suppliers with FDs { SupplierID -> SupplierName, City } && { City -> Status }.

SupplierID SupplierName City Status
1 S1 Chicago 10
2 S2 Lafayette 20
3 S3 Chicago 10
4 S4 Lexington 10
5 S5 Lafyette 20

A Quick Example

A join dependency says we can break this relvar into at two components and reconstitute it without loss: S = { SupplierID, SupplierName, City }; C = { City, Status }

SupplierID SupplierName City
1 S1 Chicago
2 S2 Lafayette
3 S3 Chicago
4 S4 Lexington
5 S5 Lafyette
City Status
Chicago 10
Lafayette 20
Lexington 10

In Other Words...

Functional dependencies are join dependencies.

But not all join dependencies are functional dependencies!

BCNF and 5NF

A relvar R is in 5th Normal Form if it is in Boyce-Codd Normal Form and the relvar has no composite keys.

A relvar may be in BCNF but not 5NF if there is a composite key and there are non-functional join dependencies which are not key-based.

The Connection Trap

Suppose we have the following rules:

  • IF an Agent (A) sells a Product (P)...
  • AND (A) represents a Company (C)...
  • AND (C) offers (P)...
  • THEN (A) sells (P) for (C)

This is also known as a symmetric constraint. If this does not apply, then we fall into the connection trap.

No Connection Trap

In this setup, if all three attributes are necessary to determine which combinations are valid and which are not, then we are in 5NF.

Agent Company Product
Smith Ford Car
Smith GM Truck

No Connection Trap

Agent Company Product
Smith Ford Car
Smith Ford Truck
Smith GM Car
Jones Ford Car

Now, we see two agents (Smith and Jones) who sell various products for Ford and GM. Ford manufactures cars and trucks, whereas GM only manufactures cars.

If we have a rule in which company salesmen sell all company products, then we can decompose this entity further.
We then have a JD: *{ {A, C}, {C, P}, {A, P} }

No Connection Trap

Agent Company
Smith Ford
Smith GM
Jones Ford
Agent Product
Smith Car
Smith Truck
Jones Car
Company Product
Ford Car
Ford Truck
GM Car

As long as our symmetric constraint holds, everything is fine. But what happens if GM starts to sell trucks?

No Connection Trap

Agent Company
Smith Ford
Smith GM
Jones Ford
Agent Product
Smith Car
Smith Truck
Jones Car
Company Product
Ford Car
Ford Truck
GM Car
GM Truck

Now this means that Smith must begin to sell GM trucks.

The Connection Trap in Action

Now let's say that there is no such symmetric constraint. We start with similar data, though note that Jones sells GM trucks instead of cars.

Agent Company Product
Smith Ford Car
Smith Ford Truck
Smith GM Car
Jones GM Truck

The Connection Trap in Action

Agent Company
Smith Ford
Smith GM
Jones GM
Agent Product
Smith Car
Smith Truck
Jones Truck
Company Product
Ford Car
Ford Truck
GM Car
GM Truck

We've broken this out. But is it correct?

Nope! This says that Smith sells GM trucks, but that's not true!

Why Bother?

If 5NF seems like overkill, here are a few reasons why it can be important.

  • With no composite keys, BCNF is 5NF, so you're already good.
  • Relvars not in 5NF can suffer from deletion anomalies and insertion anomalies.
  • Symmetric constraints aren't extremely common, but there is a common variant which you'll want to think about.

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

Fourth Normal Form (4NF)

Fourth Normal Form is an anachronism--it's really just a subset of 5NF.

Relvar R is in 4NF if and only if every Multi-Valued Dependency of R is implied in the keys of R.

Multi-Valued Dependency: a join dependency with exactly two components.

Multi-Valued Dependencies

Suppose the following:

  • Course CNO can be taught by teacher TNO using textbook XNO
  • Each course c has a set T of teachers who can teach it and a set X of textbooks it uses
  • For each course c, there is a tuple for every combination of teacher t and textbook x

Multi-Valued Dependencies

CNO TNO XNO
C1 T1 X1
C1 T1 X2
C1 T2 X1
C1 T2 X2

*{ {CNO, TNO}, {CNO, XNO} }

CNO ->-> TNO && CNO ->-> XNO, or CNO ->-> TNO|XNO

Fagin's Theorem

Relvar R can be nonloss decomposed into its components XY and XZ if and only if X->->Y|Z holds in R.

In other words, if there are multiple, independent attributes, split them out into separate relvars.

Solving for 4NF

Suppose Mr. Smith has two skills: he can cook and he can type.

Furthermore, he is fluent in German, French, and Greek.

How do we represent this information?

Bad Solution #1: Disjoint Format

Employee Skill Language
Smith Cook
Smith Type
Smith French
Smith German
Smith Greek

Problems:

  • What do blanks actually mean?
  • What if Mr. Smith didn't know any languages?

Bad Solution #2: Minimal (with NULL)

Employee Skill Language
Smith Cook French
Smith Type German
Smith Greek

Problems:

  • Can Mr. Smith cook French cuisine and type in German?
  • What can Mr. Smith do in Greek?
  • What if Mr. Smith forgets German?

Bad Solution #3: Minimal (with repetition)

Employee Skill Language
Smith Cook French
Smith Type German
Smith Type Greek

Problems:

  • Can Mr. Smith cook French cuisine and type in German and Greek?
  • What if Mr. Smith forgets German?
  • What if Mr. Smith forgets how to type?
  • What if Mr. Smith forgets French?

Bad Solution #4: Unrestricted

Employee Skill Language
Smith Cook French
Smith Type
Smith German
Smith Type Greek

Problems:

  • How do we even understand this data?
  • We have NULL values, repetition, and confusing combinations all at the same time

Good Solution

Employee Skill
Smith Cook
Smith Type
Employee Language
Smith French
Smith German
Smith Greek
  • { E, S, L } was good enough for BCNF ({ E, S, L } was the candidate key) but not good enough for us.
  • The join dependency *{ {E, S}, {E, L} } needed to be broken out.

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

Sixth Normal Form

Relvar R is in Sixth Normal Form (6NF) if and only if all join dependencies which hold in R are trivial.

In other words, there may be one non-key attribute in addition to the key.

An Example

Suppose we have a relvar { ProductKey, ProductName, ProductColor, ProductWeight, StorageCity }. Furthermore, our natural key is ProductKey. This relvar is in 5NF (BCNF + non-composite key) and can be made 6NF in the following way:

{ ProductKey, ProductName }

{ ProductKey, ProductColor }

{ ProductKey, ProductWeight }

{ ProductKey, StorageCity }

An Example

Note that each of these new relvars has a simple predicate (natural-language depiction of a relvar's utility) with no "AND":

Relvar Predicate
{PK,ProductName} Product PK has name ProductName.
{PK,ProductColor} Product PK has color ProductColor.
{PK,ProductWeight} Product PK has weight ProductWeight.
{PK,StorageCity} Product PK is stored in city StorageCity.

Do We Need 6NF?

Maybe!

  • If an attribute always has a value, it's better to cluster these attributes together.
  • If an attribute may be missing, it might be better to break it out into 6NF (to avoid NULL).

Suppose we might not always have product weight. Then we'd have:

{ ProductKey, ProductName, ProductColor, StorageCity }

{ ProductKey, ProductWeight }

Do We Need 6NF?

6NF works best with intervals of data. Usually this is temporal data. Example:

SupplierKey City Interval
Supplier1 London D4:D8
Supplier2 Paris D3:D3
Supplier2 Athens D5:D10

Do We Need 6NF?

Suppose we have an SCD2 style table:

SupplierKey City Assistants Interval
Supplier1 London 4 D1:D8
Supplier1 London 5 D9:D14
Supplier1 Paris 5 D15:D17
Supplier2 Athens 3 D1:D17
Supplier2 Athens 4 D18:D18

Do We Need 6NF?

Break it into 2 non-redundant relvars:

SupplierKey City Interval
Supplier1 London D1:D14
Supplier1 Paris D15:D17
Supplier2 Athens D1:D18
SupplierKey Assistants Interval
Supplier1 4 D1:D8
Supplier1 5 D9:D17
Supplier2 3 D1:D17
Supplier2 4 D18:D18

To bring these back together, unpack on Interval, join together elements, and re-pack on Interval. This is a gaps & islands problem!

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

Normalization and Data Warehousing

Relational theorists like Date tend not to like the Kimball model very much. As a general-purpose relational model, the star schema is terrible.

It is, however, a great model for data warehousing if you follow certain rules:

  • One and only one process to write data
  • The model is a proper model of business processes and understanding
  • Facts represent measurements of things
  • Dimensions represent further information about things
  • The appropriate slowly-changing dimension type is used for each dimension

Normalization and Overnormalization

Overnormalization isn't really a thing. What people tend to see as overnormalization tend to be one of the following:

  • Trying to remove legitimate business flexibility
  • Lossy decomposition and phantom rows

Normalization and Denormalization

Normalization has very specific rules: you can tell where you are and what you'd need to do to get to the next level. With denormalization, how can we answer the following questions?

  • Denormalize to what extent?
  • Which method(s) should I use to denormalize these relvars?
  • Is my data model sufficiently denormalized?
  • Has my data model been denormalized correctly?

Normalization and Denormalization

General philosophy:

  • Denormalizing to increase redundancy can be okay when needed. Ex: data warehousing.
  • Denormalization can add invalid design ideas back in--avoid those at all costs.

Normalization and Its Discontents

Normalization is not the end-all of relational design.

  • Join dependencies and functional dependencies are not the only kinds of constraints, and normalization will not help with these others
  • Multiple non-loss decompositions into 5NF may be possible, but there is no formal guidance on which to choose
  • Applying academic advice to practical circumstances is not always a direct mapping
  • Normalization is explicitly about the logical model, not the physical model

Wrapping Up

This has been a look at normalization from an academic perspective. This can be a complicated topic and I've tried to straddle the line between formal correctness and ease of explanation.

Wrapping Up

To learn more, go here:
https://CSmore.info/on/dbdesign


And for help, contact me:
feasel@catallaxyservices.com | @feaselkl


Catallaxy Services consulting:
https://CSmore.info/on/contact