A beast to be reckoned with, so this is how I slayed it.
Normalization, a very basic concept for databases, but something that is VITAL and very necessary in order to main a structured, organized, and meaningful database. Normalization is used as a series of steps used to eliminate redundancy and reduce the chances of data inconsistency in a database’s schema. Database normalization can be remembered with this quote:
Within any given table, every column is dependant on the key, the whole key, and nothing but the key, so help me Codd.
Truthfully, I had wanted to do a write up on normalization. But as I was doing my research in this “stream” of consciousness, I realized that there was a beautiful reddit user that had answered it as best as I could. Normalization, by no means, is necessary or required to continue with database infrastructure and design, but by hell does it make it a lot more clean, organized and less prone to breaking errors. In fact, there are classes entirely based around “Database Design” on finding the most effective way of creating a schema, and its different necessary tables that not only prevent redundancies but allow each other to stay updated without manual entries.
I do believe there does have to have a basic introduction to the first point, on a key. We have a Primary Key, and kind of going off of its name, it is the most important column in a table because it is what makes each entry or row, unique. It’s what can be used to differentiate between (1, John, Smith) and (37, John, Smith). The names are the same, but they are differentiated by the primary key, which is typically a unique value, but most commonly an integer in data entry.
Cannot be null and has to be unique. The value cannot change. If you want to delete a table, you have to delete the table first with the foreign key referencing the primary key, otherwise it will give you an error about dependency in tables.
Normalization is a process of decomposing relations (tables). The decomposition operator is “projection”–creating two or more tables from the original. Going from a lower normal form to a higher one always involves projecting two or more tables from one. In order to do so, it’s useful to use primary keys that can help make each row of data unique, but another important key is the Foreign Key. Confusing at first, it’s important to have them link data and rows between tables. The foreign key of one table, can actually be used as the primary key of another table, however that can be frowned upon depending on the situation, but as long as the “now primary key” is still unique in its own table, it’s all gucci. Vice versa is alright as well, the primary key of a table can be used as a foreign key in another table, since it has no need to be unique in its own table.
Foreign key references primary key, must be the same value as primary key but can have another name, as long as relationship is established. Ensures corresponding rows in other tables, do not have to be unique, but can be null even though primary keys can not.
Sometimes there are tables that are broken down in multiple tables to prevent data redundancy and clutter, but for data relating to the same person, it’s important to have that “binding” value that can be used to update the information and what not. In the case of John Smith again, it can be used to differentiate between employee information (1, John, Smith, Employee ID, Address, City, State, Zip, Code) to another table regarding employee salaries such as, (36, John, Smith, SAME EMPLOYEE ID, Salary). Although the primary key is different in both cases, remember they don’t actually have to be an integer, just a uniquely identifying value, the employee ID is able to connect both tables and both John Smith’s together.
Another thing to consider is that the primary key does not actually have to be just one field, in many relational databases, you can link 2 or more columns together to be a primary key. In many of the smaller databases, often times, full name and address are typically more than enough to be the primary key. As a result, as databases get bigger and bigger, you’ll need to combine more fields together as a primary key, of course it depends on the usage. In most U.S. government related services or resources, we actually have our Social Security Number, which allows a single value primary key luckily despite the population size.
For a web service, you’ll likely use either usernames or email address. I believe Square identifies me by my credit card number- any time I swipe my card, it automatically knows which email address to send the receipt to.
To get to the meat of this post however, normalization.
As I mentioned normalization is used with a combination of tables and keys to ensure that each table is meaningful, unique, and does not contain redundant data. The idea of normalization was thought up in order to bring efficiency into the table without the need to parse through MILLIONS of rows of data, and instead maybe parse through 4-5 tables of just thousands of rows of data each. To get any table to 3NF, you start at step number 1, and go through each step until you finish step 4. For me, 2nd and 3rd Normal Forms are rather interelated because of their purpose. 2nd normal form will essentially reduce the same values in a table into another table, and relate them with a foreign key. But 3rd normal form is when you are able to get data from a second table, by using the foreign key between the first and second table.
Without normalization, it can lead to needless data duplication and then modification anomalies in slight changes, that would then require manual entries into each of the fields.
Insert Anomaly – Unable to insert without providing a primary key.
Update Anomaly – Same information in multiple rows, if changes occur, then multiple updates have to be made, and if not done, inconsistency can occur.
Deletion Anomaly – Deleting an entry can delete the rest of the information that was linked with him, but not necessarily linked for him.
1st Normal Form – Eliminates repeating groups, separate tables, and the use of a primary key. An example would be a company that has multiple vendors for the same item. The vendor code could be the primary key, instead of the inventory item because that could have multiple vendors and thus would not be a good primary key. The item could then be linked to each vendor code. Although there may be redundant information if a vendor code supplies many products to a company, that is what first normal form encapsulates. The columns are unique, but the rows may not be.
2nd Normal Form – Seeks to get rid of the redundancies created with multiple items from a single vendor. It is not important to split the products into their own table, and use the primary key of the first table (vendor code) as the foreign key (vendor code) in the second table. So the first table can contain the vendor code and maybe the name. While the second table can then contain each of the item products with the vendor code as its foreign key. Removes subsets of data that can apply to multiple rows, into their own separate table. However, still important to use a foreign key to link relationships.
3rd Normal Form – Typically done along with 2nd Normal Form, this is now done to clean up the tables and tries to avoid mentioning the foreign key repeatedly as well as the item products if they come from various vendors. Ensure that the tables have a primary key, and none of the columns nor rows depend on any other keys in that table. Primarily meant to remove any ambiguity in determining the primary and foreign keys. As a result, each non-key element in each table, provides information about the key itself. As a result, you can now add another unique primary key (double whammy) to distinguish each item product from the vendor now, instead of having the product have multiple vendor codes (foreign key).As a result, both tables will have a primary key, but the second table, the item products will have not only its own primary key, but also use the vendor code primary key as its foreign key.
All non-primary-key fields which are not completely dependent on the primary key, should be placed in a separate table with a dependent primary key.
Although there are more levels of normalization, typically 3rd normal form is enough to create unique enough tables. Another table could be added to specifically link the vendor and item product keys to determine specifically which item is provided by each vendor.
** Something to consider is denormalization and when it should be used. Only used in rare instances such as in data warehousing, data mining and in high traffic websites. The reason for that is fast transactions and processes. Data will repeat in more than one table in denormalized databases which allows querying faster. Aside from that, it allows data mining algorithms to run against the data very quickly as opposed to using JOINS as an additional layer. As a result, it can help speed up calculations and make them able to pull data from other tables meaningfully.