Designing Normalized SQL Tables

SQL databases are reliable for complex queries, partial updates, transactions, and decoupling data modeling from application specific contexts. In this post we’ll cover how to normalize tables in 1NF, 2NF, and 3NF.

NFL_63650.jpg

We have a table.

Name Position Team Jersey No. Home Stadium Stadium City Salary & Contract
Aaron Rodgers QB Packers 12 Lambeau Field Green Bay $22M, 5 years
J.J. Watt DL Texans 99 NRG Stadium Houston $16.7M, 6 years.
Earl Thomas DB Seahawks 29 Century Link Field Seattle $10M, 4 years
Andrew Luck QB Colts 12 Lucas Oil Stadium Indianapolis $5.5M, 4 years

Some issues off the cuff, right? What if I wanted to add a new team but had no player? What if a player plays multiple positions? Or two teams share the same home stadium? All these deal with data redundancy. And what’s the primary key, anyways?


 1NF

1NF defines the shape of a table.

Concepts

  1. Unique rows - A required primary key makes each row unique
  2. Atomacity - In every row, each column has a single value. Data in each cell won’t be decomposed further[1]

Benefits

Atomacity provides general purpose queries where selects and updates occur to single values. It contrasts with queries that need to parse, substring, and concatenate values, leaving room for data integrity errors.

Let’s break Salary and Contract into their own columns.

Name Position Team Jersey No. Stadium Stadium City Salary Contract (yrs)
Aaron Rodgers QB Packers 12 Lambeau Field Green Bay $22M 5
J.J. Watt DL Texans 99 NRG Stadium Houston $16.7M 6
Earl Thomas DB Seahawks 29 Century Link Field Seattle $10M 4
Andrew Luck QB Colts 12 Lucas Oil Stadium Indianapolis $5.5M 4

In our table we could have many players with many same values– same name, same position, same team, or others. So the minimal fields to identify a unique row is:

{Team}, {Jersey_No}

We’re in 1NF.


 2NF

Both 2NF and 3NF deal with functional dependencies between key and non-key fields.

Concepts

A field Y is “functionally dependent” on field(s) X if X‘s value always produces the same Y value[2]. One example is how Stadium City functionally depends on Home_Stadium, also said Home_Stadium determines Stadium City. Lambeau Field is always in Green Bay, WI.

2NF is met when a table is in 1NF and all non-key fields depend on the entire key-ed fields. Tables with only one key field are automatically in 2NF[2].

Benefits

2NF minimizes data redundancy. Recall our primary key from 1NF,

{Team}, {Jersey_No}

Non-key fields Name,Position, Salary and Contract depend on the entire primary key, but Home_Stadium and Stadium_City only depend on Team! Every time we insert a new row where Team is the same, Home_Stadium and Stadium_City become re-inserted as duplicates. So say we insert Marshawn Lynch, a second Seahawk.

Name Position Team Jersey No. Home Stadium Stadium City Salary Contract (yrs)
Earl Thomas DB Seahawks 29 CenturyLink Field Seattle $10M 4
Marshawn Lynch RB Seahawks 24 CenturyLink Field Seattle $12M 2

Aha! CenturyLink Field and Seattle are duplicate data.

More abstractly, fields relying on a subset of the primary key produce redundancy in each row where the subset is the same and the non-subset is different.

To acheive 2NF we move these into tables where non-key fields depend on the entire key field.

Team

Team Home Stadium Stadium_City
Packers Lambeau Field Green Bay
Texans NRG Stadium Houston
Seahawks Century Link Field Seattle
Colts Lucas Oil Stadium Indianapolis

primary key: {Team}

Roster

Team Jersey_No Position Name Salary Contract
Packers 12 QB Aaron Rodgers $22M 5
Texans 29 DL J.J. Watt $16.7M 6
Seahawks 99 DB Earl Thomas $10M 4
Colts 12 QB Andrew Luck $5.5M 4
Seahawks 24 RB Marshawn Lynch $12M 24

composite primary key: {Team}, {Jersey_No}

Better. Level up.


 3NF

Concept

3NF furthers the principle that tables serve a single purpose. It prevents a non-key field C from being functionally dependent on another non-key field B, making C have transitive-dependence on primary key A.

  1. In our Roster table Name determines Salary and Contract making them transitive dependencies on primary keys Team and Position.
  2. Also, in the Teams table Home_Stadium determines Stadium_City making it transitively dependent on Team.

Benefits

Non-transitive dependencies minimize data redundancy. 3NF is achieved by moving all non-key to non-key functional dependencies to their own table[2].

Our four normalized tables.

Roster

Team Jersey_No Position Name
Packers 12 QB Aaron Rodgers
Texans 29 DL J.J. Watt
Seahawks 99 DB Earl Thomas
Colts 12 QB Andrew Luck
Seahawks 24 RB Marshawn Lynch

composite primary key: {Team} {Jersey_No}

Player

Name Salary Contract (yrs)
Aaron Rodgers $22M 5
J.J. Watt $16.7M 6
Earl Thomas $10M 4
Andrew Luck $5.5M 4

primary key: {Name}

Team

Team Stadium
Packers Lambeau Field
Texans NRG Stadium
Seahawks Century Link Field
Colts Lucas Oil Stadium

primary key: {Team}

Stadium

Stadium Stadium_City
Lambeau Field Green Bay
NRG Stadium Houston
Century Link Field Seattle
Lucas Oil Stadium Indianapolis

primary key: {Stadium}

We can now add more teams, players, positions and stadiums with much less redundancy.

 What about Serial PKs?

Serial primary keys prevent collisions and allow for more variable data as they remain unchanged. But they offer no value in the logical data model and organization, so they shouldn’t act as primary keys when normalizing tables. First create normalized tables, then add Serial PKs where necessary.



In this post we covered steps to create tables free of modification anomalies, minimizing the likelihood of redesign. It’s important to note these are guidelines and at times it’s worth the engineering trade off of less tables to improve performance at the expense of some data redundancy[4].

Normalized SQL tables have primary keys, atomic values (1NF), non-key fields that depend on the entire key fields (2NF), and with non-keys that only depend on key fields (3NF). Our NFL example used small tables but think of how much data sites like ESPN holds. If you have questions or comments, please reach out on
LinkedIn
.


 References

[1] Date, C. (2006). Date on database: Writings 2000-2006. Berkeley, CA: Apress, Chapter 8.

[2] William Kent, “A Simple Guide to Five Normal Forms in Relational Database Theory”, Communications of the ACM 26(2), Feb. 1983, 120-125.

[3] Date, C. J. (2000), An Introduction to Database Systems (8th ed.). Addison-Wesley Longman

[4] Obasanjo, D., When Not To Normalize Your SQL Database. http://www.25hoursaday.com

 
22
Kudos
 
22
Kudos

Now read this

Rediscovering MVC and How to Write without a Framework

If you’ve paid much attention to front-end development in the last few years you’ve heard about Angular, Backbone, Ember, and other JavaScript MV* frameworks. They offer structure, bundled APIs and streamlined approaches to complex UIs,... Continue →