The Dimensional Data Model! Starting from the ground up!
So let’s take some time to analyze one of the most common methods in data warehousing, using the dimensional data model as its foundation!
In most relational databases, the primary concept is to relate a row of information and values to a column of metrics, a 1 to 1 relationship. However, in data warehousing, those kinds of tables can make querying very time and resource intensive, so data warehouses were formed to help with quick lookups as well as a more comprehensive search for the necessary and related information. With a dimensional database, it essentially acts as a repository of row values with its associated column identifier, so within the column dimension, there can be a variety of column attributes. So the column dimension could be “Months”, but each of the attributes could be the 12 months. After that, the column hierarchy could be the days within the months. So all in just one column, could be time/months/day, which doesn’t fit the 3rd Normal Form structure in relational databases where it’s more of a 1to1 relationship.
Each dimensional database also has a lookup table, which allows the user to query specific and detailed information about the attributes that are being used. Essentially used as a filter, dimensional databases use lookup/fact tables to help narrow down the information being searched, however they are “separate” tables in that they would be like a table just for foreign keys to help us identify which values would fall under which attribute. Now, let’s take a bit more of a look into the schema types used in dimensional databases, star and snowflake schemas.
Starting with the star schema, we can see that the fact table sits in the middle, and can thus provide a table of relationships on each of the different dimensions and its attributes. It is connected to the other surrounding objects like a star, where each “point” is represented as a single table. The primary key in each dimension, will be foreign keys in the fact table. Each value in the fact table is related to the dimensions that they are referencing. The image depicts a simple star schema, however if there are more than one fact tables, then it would be a “complex star schema” instead. Interestingly enough, there are no lines in between the dimensions meaning that they are not related to one another, not even through the use of a primary key/foreign key pair.
A foreign key has to be a primary key, but a primary key does not have to be a foreign key.
An extension of the star schema, each points now has their own related tables as well. Before in the star schema, each dimension is considered a single dimensional table, however, in the snowflake schema, the dimensional table becomes normalized into many lookup tables for that “branch”. One of the main improvements is query performance due to compressing more information together and joining smaller lookup tables. However, the problem is the increased maintenance due to more lookup tables in these “branches”.
How To Determine The Fact Table!
Now here is a little curious case, what has to go into designing the fact table itself? Obviously, we need to determine the kind of information that will be stored, so it’s important to consider the dimensions included, and how far into the hierarchy should the information be kept?
For the most part, the organization C-Men will usually dictate what dimensions will be included into the fact table. An example would be the financial department of a sports store wanting the products, costs, customer etc. Another would be a “rewards program” card where they will want the customer ID, records, personal information, and any other loyalty information, which can be used to help calculate discounts etc.
However, the hard part is determining the level/hierarchy of the information that should be stored. This is typically figured out by the user requirement and asking the users what information they would most likely need. In terms of the sports store example, how often do they need to pull the information? Weekly for a monthly report? Daily for a weekly report? Hourly? As a result, the lowest level of information needed, is considered the most “granular” in the time dimension. Trying to figure it out during implementation can make the process much more easier instead of having to redesign it after the organization and data warehouse has already been established and implemented.
3 Types of Facts:
- Additive – Facts that can be summed up through all of the dimensions in the fact table
- Semi-Additive – Can only be summed up in a few of the dimensions on the fact table, but not all of them
- Non-Additive – Facts that can not be summed up.
2 Types of Fact Tables
- Cumulative – Tables where the facts can be added up, typically additive facts.
- Snapshot – Provides values at a particular instance of time, usually semi-additive or non-additive facts.
How To Account For Changes In Dimension? What Data Model?
Say we have Joe Schmoe and their zip code is 60616 at the first entry, but the time to record the second entry, it is found that they are now residing in 91775. There are three possible ways to account for this discrepancy, simply replace the original record. Create a second record, meaning there are now two Joe Schmoe’s, or to simply modify the record to reflect the change. In practice though, it was found that only the first two options are really done and implemented. Regarding the replacement method, this is typically done when the data warehouse does not have to keep a track of historical changes. With the additional entry method, it is used when the data warehouse has to be keeping track of the historical changes, however it can greatly increase the size of the table, and then can complicate the ETL process. In the dimensional database, once these changes have been implemented, let’s take a step back and ask, how should we even model the data?
There are three different kinds of data modeling schemes, conceptual, logical, and physical. Here is an image that depicts the differences.
So based off of this table, it seems that there is a logical progression. With the conceptual model, it’s essentially a proof of concept and a quick sketch in determining the database tables and architecture. There are not specific entities and relationships, no attributes or even a primary key is picked. The conceptual model is typically the first model to be implemented or planned, as it is more aptly used. With the logical model, it now adds some entities and relationships as well as establishes a primary key. It allows the user or DBA to see what kinds of values and metrics are being measured. Start by specifying primary keys, find the relationships and attributes, and then normalize with a foreign key! With a logical model, there are relationships established that the conceptual model did not have. Lastly, the physical model takes it one step farther by changing the entity names and relationships and actually converting them into tables and columns, as well as foreign keys established!
Now that’s that! Enough of this already? Well me toooo! Good night! And check back in a couple of days for the next topic!