Perfectly Normal

An Advanced Primer on Normalization

Kevin Feasel (@feaselkl)
http://CSmore.info/on/dbdesign

Who Am I? What Am I Doing Here?

Is This Talk Right for Me?

Have you ever believed any of the following?

  • "Normalization isn't relevant in the business world."
  • "Normalize until it hurts, denormalize until it works."
  • "The main purpose of normalization is to minimize disk utilization by removing duplicate data."
  • "Database design doesn't matter--add more hardware and that will fix the problem."
  • "Getting to Third Normal Form is all you need to know about normalization."
  • "Third Normal Form is important."

Motivation

My goals in this talk:

  • Explain the basics behind database normalization.
  • Dive into the two most important normal forms.
  • Briefly cover other interesting normal forms.
  • Cover critiques of normalization.

A Brief Warning

Normalization is a part of relational theory, meaning that there will be math and there will be discussions of formal logic.

I will do my best to minimize the learning curve, but we're ultimately solving math problems using logical analysis.

Also, some of these normal forms are not well-understood outside of academic circles and so your interpretation of the forms may not match what I'm going to describe.

The Source

The source of most of today's talk is C.J. Date's Database Design and Relational Theory, 2nd Edition (Apress, 2019).

The Scenario

Our scenario is a music studio, responsible for managing artists, cutting albums, marketing, and distribution.

This studio is working on a centralized database system but their application developers don't have much experience and would like our advice on how to do things the right way.

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

What is Normalization?

Normalization is a set of logical rules and processes to follow.

Normalization works off of a logical data model. From there, we can (theoretically) adapt to a physical data model for implementation.

Quick Primer on Terminology

We're used to thinking in terms of tables, rows, and columns. Let's take a step back and talk about the data modeling terms.

Goals of Normalization

  • Remove redundant data
  • Fix incorrect designs
  • Achieve a design which is a good representation of the real world
  • Simplify the statement and enforcement of certain integrity constraints

The Normal Forms

Normal Forms Ranked by Importance

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

What 1NF Is

First normal form (1NF) is all about key rules and the shape of tuples.

Given relvar R with heading H containing attributes A1...An of types T1...Tn, all tuples follow heading H and have one value of type Ti for attribute Ai.

In Other Words

Artist (CHAR) Album (CHAR) ReleaseYear (INT)
Tsunami Bomb The Ultimate Escape 2002
Tsunami Bomb The Definitive Act 2004
Tsunami Bomb Trust No One 2016

Adding a new tuple t(Tsunami Bomb, The Spine that Binds, 2019) requires that the new tuple's heading, with attributes and types, matches the existing relvar's heading.

The Rules on 1NF

First normal form (1NF) has a few rules:

  • New tuples match the existing relvar's heading.
  • All tuples must have values for all attributes. This means NULL violates 1NF!
  • No duplicate tuples are allowed
  • Attribute order does not matter for tuples
  • Tuple order does not matter for relations
  • All attributes are regular: they have a name, a type, are not hidden, etc.

Quick Quiz -- Is This in 1NF?

Quick Quiz -- Is This in 1NF?

Answer: YES!

Repeated sets are NOT a violation of 1NF!

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

The Importance of BCNF

Boyce-Codd Normal Form is one of the two most important normal forms. It prevents a variety of data anomalies:

  • Update anomalies
  • Redundant data (some types)
  • Missing data

We Have Data

Date Artist Album Genre Studio Manager
2023-07-10 Kidneythieves A1 Industrial A Marie
2023-07-10 Metric A3 Indie A Marie
2023-07-10 Land of Talk A2 Indie B Jacques
2023-07-11 Kidneythieves A1 Industrial A Marie

Update Anomalies

Lindsay takes over Studio A. But we miss it in one entry.

Date Artist Album Genre Studio Manager
2023-07-10 Kidneythieves A1 Industrial A Lindsay
2023-07-10 Metric A3 Indie A Lindsay
2023-07-10 Land of Talk A2 Indie B Jacques
2023-07-11 Kidneythieves A1 Industrial A

Marie

Redundant Data

Kidneythieves had to cancel so we booked Metric for the entire day. Let's update the database.

Date Artist Album Genre Studio Manager
2023-07-10 Metric A3 Indie A Lindsay
2023-07-10 Metric A3 Indie A Lindsay
2023-07-10 Land of Talk A2 Indie B Jacques
2023-07-11 Kidneythieves A1 Industrial A Lindsay

Missing Data

Nobody's booked in Studio C. How do we indicate that it exists?

Date Artist Album Genre Studio Manager
2023-07-10 Metric A3 Indie A Lindsay
2023-07-10 Metric A3 Indie A Lindsay
2023-07-10 Land of Talk A2 Indie B Jacques
2023-07-11 Kidneythieves A1 Industrial A Lindsay
2023-07-10 NOBODY NONE N/A C Charles

Boyce-Codd Normal Form Defined

We can define BCNF as:

All functional dependencies have superkey determinants.

Superkey

Superkey: a unique combination of attributes, but not necessarily the most parsimonious unique combination of attributes.

If (RecordingDate, Artist) is a key, (RecordingDate) is a subkey and (RecordingDate, Artist, Studio) is a superkey.

Functional Dependencies

Suppose we have some attribute SoundEngineer.

The value of SoundEngineer depends on inputs Producer and Studio:

Studio Producer SoundEngineer
A Janice Sam
B Randi Edgar
A Janice Sam
C Tamika Sam

Functional Dependencies

When we see Studio=A and Producer=Janice, we know SoundEngineer=Sam.

In other words, Studio and Producer determine SoundEngineer, or (Studio, Producer) is the determinant for SoundEngineer.

In mathematical terms, we can write this as (Studio, Producer) -> SoundEngineer.

Functional Dependencies

Note that functional dependencies only go one way.

(Studio, Producer) -> SoundEngineer.

Studio=A && Producer=Janice -> SoundEngineer=Sam.

But Sam can be the sound engineer for other combinations of Studio + Producer. The only thing we guarantee is that (A, Janice) -> Sam (for all combinations of Studio*Producer and their relevant SoundEngineers).

Figuring out Functional Dependencies

NOTE: figuring out functional dependencies in practice may be difficult!

There are two primary techniques for figuring them out in practice.

  • Ask an expert -- discuss business rules and find out what they know.
  • Review existing data and see where we can find columns whose values always match up in this way. Functional dependencies must always hold; it can't just be a "sometimes" thing.

Boyce-Codd Normal Form Reviewed

All functional dependencies have superkey determinants.
  • Superkey: key and maybe more.
  • Functional dependency: strict relationship between attributes.
  • Determinant: "input(s)" for the functional dependency.

Relvars in BCNF are guaranteed to suffer from none of the anomalies we just described!

The Solution, Step 1

Start with a relvar whose heading has all attributes. Figure out the candidate key(s) and functional dependencies for this relvar.

CampaignID CName, COName, AgencyID, MaxBudget, MinBid
AgencyID AName, TakeRate
AudienceTargetID TAgeRng, TGenPref
CampaignID, AudienceTargetID AudienceBidModifier

The Solution, Step 2

Next, ask this question: are all determinants candidate keys? The answer is obviously no. Only the final determinant is.

Determinant Dependency
CampaignID CampaignName, CampaignOwnerName, AgencyID, MaximumDailyBudget, MinimumBid
AgencyID AgencyName, TakeRate
AudienceTargetID TargetAgeRange, TargetGenrePreference
CampaignID, AudienceTargetID AudienceBidModifier

Intermission: Heath's Theorem

Let's introduce Heath's theorem here:

Given a relvar R with components { X, Y, Z }, as well as functional dependency X --> Y, we can break R into two relvars, containing { X, Y } and { X, Z }, without loss.

The Solution, Step 3

If the answer to Step 2 was no, take a non-candidate key determinant and break it out into its own relvar, following Heath's theorem.

The Solution, Again

Repeat steps 2 and 3 as long as there are more functional dependencies which are not candidate keys. Let's take CampaignID next.

The Solution, Again Again

Our remaining functional dependencies are:

Determinant Dependency
AudienceTargetID TargetAgeRange, TargetGenrePreference
CampaignID, AudienceTargetID AudienceBidModifier

This is not good enough! AudienceTargetID is a subkey of the candidate key, not a superkey! Therefore, break it out.

The End Result

Here is our final data model, in BCNF. Also, I took the opportunity to rename the DM Campaign relvar to CampaignTarget, which better represents its meaning.

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

Fifth Normal Form (5NF)

Fifth Normal Form is the second most-important normal form. What BCNF is to functional dependencies, 5NF is to join dependencies

Relvar R is in 5NF if and only if every join dependency in R is implied in the keys of R.

Join Dependency: if X1, X2, ..., Xn are the subsets of heading H on relvar R, a join dependency holds in R if and only if R can be non-loss decomposed into its projections on X1, X2, ..., Xn.

A Quick Example

Suppose we have a relvar called Warehouse with FDs { WarehouseID -> GeneralManager, City } && { City -> PropertyTaxRate }.

WarehouseID GeneralManager City PropertyTaxRate
1 Steve Chicago 4.8%
2 Lewis Lafayette 2.1%
3 Suzanne Chicago 4.8%
4 Alexa Lexington 3.3%
5 Lee Lafyette 2.1%

A Quick Example

A join dependency says we can break this relvar into at two components and reconstitute it without loss: W = { WarehouseID, GeneralManager, City }; C = { City, PropertyTaxRate }

WarehouseID GeneralManager City
1 Steve Chicago
2 Lewis Lafayette
3 Suzanne Chicago
4 Alexa Lexington
5 Lee Lafyette
City Status
Chicago 4.8%
Lafayette 2.1%
Lexington 3.3%

Functional and Join Dependencies

Functional dependencies are join dependencies.

But not all join dependencies are functional dependencies!

BCNF and 5NF

A relvar R is in 5th Normal Form if it is in Boyce-Codd Normal Form and the relvar has no composite keys.

A relvar may be in BCNF but not 5NF if there is a composite key and there are non-functional join dependencies which are not key-based.

Another Example

Artist Album SalesCountry
Horrorpops Bring It On! United States
Horrorpops Bring It On! Denmark
Band 1 Unnamed Future Album Canada
Band 2 Unnamed Future Album Canada
Band 1 Unnamed Future Album United States
Band 2 Unnamed Future Album United States

Another Example

This relvar AlbumSales has a key of { Artist, Album, SalesCountry }. This means it is in BCNF, but if it feels weird to you, it should!

There are two join dependencies here! { Artist, Album } and { Album, SalesCountry } can be cleanly split out from the ternary relationship into two separate relvars.

Another Example

Artist Album
Horrorpops Bring It On!
Band 1 Unnamed Future Album
Band 2 Unnamed Future Album
Album SalesCountry
Bring It On! United States
Bring It On! Denmark
Unnamed Future Album Canada
Unnamed Future Album United States

The Connection Trap

Suppose we have the following rules:

  • IF a Store (S) stocks a Genre (G)...
  • AND (S) stocks the work of an Artist (A)...
  • AND (A) performs in (G)...
  • THEN (S) sells works in (G) from (A)

This is also known as a symmetric constraint. If this does not apply, then we fall into the connection trap.

No Connection Trap

In this setup, if all three attributes are necessary to determine which combinations are valid and which are not, then we are in 5NF.

Store Genre Artist
Music Everywhere Punk Rock Bad Religion
Music Everywhere Prog Rock Rush
The Music Factory Indie Metric

No Connection Trap

Store Genre Artist
M.E. Punk Bad Religion
M.E. Prog Rush
MFac Indie Metric

Now, we see two stores (Music Everywhere & The Music Factory) who sell albums by various artists in different genres. Bad Religion makes Punk Rock albums, Rush makes Prog Rock albums, and Metric makes Indie albums.

If we have a rule in which stores sell all artists in a genre, then we can decompose this entity further.
We then have a JD: *{ {S, G}, {G, A}, {S, A} }

No Connection Trap

Store Genre
M.E. Punk
M.E. Prog
MFac Indie
Genre Artist
Punk Bad Religion
Prog Rush
Indie Metric
Store Artist
M.E. Bad Religion
M.E. Rush
MFac Metric

As long as our symmetric constraint holds, everything is fine. But what happens if Bad Religion sells a Prog Rock album?

No Connection Trap

Store Genre
M.E. Punk
M.E. Prog
MFac Indie
Store Artist
M.E. Bad Religion
M.E. Rush
MFac Indie
Genre Artist
Indie Metric
Prog Rush
Punk Bad Religion
Prog Bad Religion

Now this means that Music Everywhere must sell Bad Religion's Prog Rock album.

The Connection Trap in Action

Now let's say that there is no such symmetric constraint. The only results we should see are the following:

Store Genre Artist
Music Everywhere Punk Rock Bad Religion
Music Everywhere Prog Rock Rush
The Music Factory Indie Metric

The Connection Trap in Action

Store Genre
M.E. Punk
M.E. Prog
MFac Indie
Store Artist
M.E. Bad Religion
M.E. Rush
MFac Indie
Genre Artist
Indie Metric
Prog Rush
Punk Bad Religion
Prog Bad Religion

We've broken this out. But is it correct?

Nope! This says that Music Everywhere will sell Bad Religion's Prog Rock album, but that's not true!

Why Bother?

If 5NF seems like overkill, here are a few reasons why it can be important.

  • With no composite keys, BCNF is 5NF, so you're already good.
  • Relvars not in 5NF can suffer from deletion anomalies and insertion anomalies.
  • Symmetric constraints aren't extremely common, but there is a common variant which you'll want to think about.

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

Fourth Normal Form (4NF)

Fourth Normal Form is an anachronism--it's really just a subset of 5NF.

Relvar R is in 4NF if and only if every Multi-Valued Dependency of R is implied in the keys of R.

Multi-Valued Dependency: a join dependency with exactly two components.

Multi-Valued Dependencies

We've already seen an example of this: relvar AlbumSales has a key of { Artist, Album, SalesCountry }.

  • Artist R is on Album A, sold in SalesCountry S
  • Each album has a set of artists, and each album is sold in a set of countries
  • For each album, there is a tuple for every combination of artist sales country

Multi-Valued Dependencies

Artist Album SalesCountry
Horrorpops Bring It On! United States
Horrorpops Bring It On! Denmark
Band 1 Unnamed Future Album Canada
Band 2 Unnamed Future Album Canada
Band 1 Unnamed Future Album United States
Band 2 Unnamed Future Album United States

*{ {Album, Artist}, {Album, SalesCountry} }

Multi-Valued Dependencies

*{ {Album, Artist}, {Album, SalesCountry} }

The "4NF" diagram for this looks like:

Album ->-> Artist && Album ->-> SalesCountry, or Album ->-> Artist|SalesCountry

Fagin's Theorem

Relvar R can be nonloss decomposed into its components XY and XZ if and only if X->->Y|Z holds in R.

In other words, if there are multiple, independent attributes, split them out into separate relvars.

What this Prevents

  • Needing to have multiple "copies" of the same record--Unnamed Future Album need one row per artist * country, but what happens if we miss one of these?
  • Any update likely needs to update multiple rows
  • Any delete operation likely needs to delete multiple rows

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

Sixth Normal Form

Relvar R is in Sixth Normal Form (6NF) if and only if all join dependencies which hold in R are trivial.

In other words, there may be one non-key attribute in addition to the key.

An Example

Suppose we have a relvar { AlbumID, AlbumName, Runtime, ReleaseYear, ArtistID, ArtistID2, GenreID, GenreID2, GenreID3 }. Furthermore, our natural key is AlbumID. This relvar is in 5NF (BCNF + non-composite key) and can be made 6NF in the following way:

An Example

{ AlbumID, AlbumName }

{ AlbumID, Runtime }

{ AlbumID, ReleaseYear }

...

{ AlbumID, GenreID2 }

{ AlbumID, GenreID3 }

Do We Need 6NF?

Maybe!

  • If an attribute always has a value, it's better to cluster these attributes together.
  • If an attribute may be missing, it might be better to break it out into 6NF (to avoid NULL).
  • If an attribute is part of a "nested set," we can unpivot these into one table.

Do We Need 6NF?

Suppose we might not always have runtime and we want to unpivot artists and genres (assuming there is no 'order' to artists or genres). Then we'd have:

{ AlbumID, AlbumName, ReleaseYear }

{ AlbumID, Runtime }

{ AlbumID, ArtistID }

{ AlbumID, GenreID }

6NF and Intervals

6NF also helps us work with intervals of data. Usually this is temporal data, like this example of a Tour relation:

HeadlinerArtist City Interval
Bad Religion London D4:D8
Horrorpops Paris D3:D3
Horrorpops Athens D5:D10

6NF and Intervals

Suppose we have an SCD2 style table:

HeadlinerArtist WarmupArtist City StartDate EndDate
Bad Religion Rad Beligion London D1 D8
Bad Religion The Deviates London D9 D14
Bad Religion The Deviates Paris D15 D17
Horrorpops The Chop Tops Athens D1 D17
Horrorpops Nekromantix Athens D18 D18

6NF and Intervals

Break it into 2 non-redundant relvars:

Headliner City Interval
BadReligion London D1:D14
BadReligion Paris D15:D17
Horrorpops Athens D1:D18
Headliner Warmup Interval
B.R. Rad Beligion D1:D8
B.R. The Deviates D9:D17
H. The Chop Tops D1:D17
H. Nekromantix D18:D18

To bring these back together, unpack on Interval, join together elements, and re-pack on Interval.

Visualizing the Interval

Visualizing the Interval

Unpacking and Re-Packing

Unpacking and Re-Packing

Unpacking and Re-Packing

This gets us back to our type-2 SCD, but without having to store denormalized data!

Agenda

  1. The Basics
  2. 1NF
  3. BCNF
  4. 5NF
  5. 4NF
  6. 6NF
  7. Normalizaton and...

Normalization and Data Warehousing

Relational theorists like Date tend not to like the Kimball model very much. As a general-purpose relational model, the star schema is terrible.

It is, however, a great model for data warehousing if you follow certain rules:

  • One and only one process to write data
  • The model is a proper model of business processes and understanding
  • Facts represent measurements of things
  • Dimensions represent further information about things
  • The appropriate slowly-changing dimension type is used for each dimension

Normalization and Overnormalization

Overnormalization isn't really a thing. What people tend to see as overnormalization tend to be one of the following:

  • Trying to remove legitimate business flexibility
  • Lossy decomposition and phantom rows

Normalization and Denormalization

Normalization has very specific rules: you can tell where you are and what you'd need to do to get to the next level. With denormalization, how can we answer the following questions?

  • Denormalize to what extent?
  • Which method(s) should I use to denormalize these relvars?
  • Is my data model sufficiently denormalized?
  • Has my data model been denormalized correctly?

Normalization and Denormalization

General philosophy:

  • Denormalizing to increase redundancy can be okay when needed. Ex: data warehousing.
  • Denormalization can add invalid design ideas back in--avoid those at all costs.

Normalization and Its Discontents

Normalization is not the end-all of relational design.

  • Join dependencies and functional dependencies are not the only kinds of constraints, and normalization will not help with these others
  • Multiple non-loss decompositions into 5NF may be possible, but there is no formal guidance on which to choose
  • Applying academic advice to practical circumstances is not always a direct mapping
  • Normalization is explicitly about the logical model, not the physical model

Wrapping Up

This has been a look at normalization from an academic perspective. This can be a complicated topic and I've tried to straddle the line between formal correctness and ease of explanation.

Wrapping Up

To learn more, go here:
https://CSmore.info/on/dbdesign


And for help, contact me:
feasel@catallaxyservices.com | @feaselkl


Catallaxy Services consulting:
https://CSmore.info/on/contact