Part 2 – Architecture Design!
Not the building kind! The ones for a data warehouse!
Reading through my books, there is already a huge emphasis on designs and how each data warehouse should be configured and maintained for different situations, organizations, and purposes. In a sense, each one is dependent on the specific needs of a business, which is why I think this will be a relatively short post today. I really don’t think that there is going to be much to discuss, or I could list out case studies on data warehouses but even I don’t have the patience for that. I’m just assuming everyone reading either already knows it, or can’t be bothered for the specifics so here is the super long, tl;dr about architecture design.
As you can see in the image, there are still the typical, relational databases that are used, however they under go a technological process called ETL (Extraction, Transformation, Loading) of the data into the specific needs of each column data value. Sent first to a staging area, the data is then sent to a data warehouse and finally predetermined data marts that are able to specify each purpose of the data. The only “annoying” thing about the process is that the data is copied over to each stage, which is good in some cases, but it can take some overhead time of course. Starting off with sort of a recap from last time, data warehouses were created due to the fragmentation of data in organizations across several systems and platforms for different locations. In order to deal with this, data was copied and made redundant however think about how annoying it would be to contact everywhere and everyone, every time there was an update and required local changes? Data quality deteriorates and the integrity falls off the map and that will lead to poor business intelligence results, accompanied with poor strategic and tactical decisions. Normal databases that we know are, “relational”, however the databases that were created were called “dimensional” because they were able to quickly consolidate, cleanse and integrate data from multiple, disparate databases, across several systems in different locations. By placing an emphasis on specific architecture design, it is able to create and design the database schema for different purposes and reporting methods.
ETL – Extract/Transform/Load is implemented in databases to allow for:
Extraction of Data
Transform/filter/cleanse bad and inconsistent data
Load data onto the target databases
Breaking Down Architecture Design
Now that ETL has been defined, it is important to explain the need of a staging area. Because data is pulled from various disparate databases, there has to be a place for the data to reside before it can be transformed into values for the data warehouse. Although it could be possible to create a “stream” of data transformation, the processing and overhead of that is much more taxing than waiting for all of the necessary data to be collected, and then integrated into the data warehouse. In addition, it has a staging area because it isn’t feasible to extract all of the data at the same time as well. An example would be to extract sales data every hour, however overall financial data would not require hourly extraction, and maybe only a weekly is necessary to compile monthly reports. Also the timing can affect it as well since it can be different depending on where data is being pulled in different locations and time differences. However, depending on the organization is not always necessary though.
Overall, after the data has been collected it is then integrated. It is consolidated and placed into a general table or a few tables after the disparate internal/external databases have sent their requested information. Unlike the name, the data warehouse isn’t the storage but only the holding place. It contains the necessary data to provide key business metrics to work with business intelligence to make strategic and tactical decisions. Also, users and decisions do not just “access” the data warehouse, but they interact with the information through the data marts which are various front-end tools that are able to generate a more readable report for the end user. Often times, this is where the architecture can end, making it an enterprise data warehouse. Integrated with organization-wide information, it can be composed of several subjects containing both detailed data and summarized information to very broad basic information. These can be very expensive and time-consuming to build. ETL is used in both kinds of architecture design, without data marts and with data marts since information has to be extracted either from the databases into the data warehouses, or the data warehouses into the data marts.
In the end, ETL extracts the data from the data warehouse into one or more of the data marts to allow the organization to view different reports and data analyses. Depending on how the information is used, each data mart can be dimensional or relational with various front-end tools to present the information. However, extraction is not the only step, many times the data pulled isn’t ready for analysis as a result it is “transformed” into a different format or organized in a way that can be used in data marts etc. depending on different criteria. ETL essentially helps to clean the data into a standard format that can be used. As I mentioned, each data mart will contain a different combination of tables, columns, and rows. So the sales department might only need sales and not specific vendor information or HR related tables and a variety of other reasons. These data marts, which are a subset of the organization-wide database (Enterprise data warehouse) are built from the bottom up for specific decision outcomes (Kimball Method – Data Mart first, built around it). As opposed to the broad data illustrated in the data warehouse, data marts contain summarized and often detailed data about the subject area. Once done, it just gets loaded back into the data marts. Yup, that’s it for “Load”. In more recent times, ELT has become a more and more viable option (Used in Late Binding). The information is still extracted, but instead of transforming at compile time, it is loaded to the data mart and then transformed just in time for the analysis to be run. By transforming the data before provides access to that data instead of having it transform and take up the time before being loaded into the data warehouse/marts.
Now, let’s take a look at another architecture model for data warehouses.
The data source layer represents all of the sources that feed data into the data warehouse. In healthcare this can be from EHR, financial departments, administration etc. These can range from operations, web servers, 3rd party data etc.
The data extraction layer, is all of the data that is extracted into the data warehouse, typically some forms of cleansing but not any transformation yet. Staging Area, where is sits before being cleaned and transformed (depending on the modeling style). ETL Layer, you better not need me to explain it again (hehe). Data Storage, placed in a couple of different possibilities, the data warehouse, any additional data marts, or an operational data store. Data Logic Layer, where the business rules are stored for the analysts to run their programs and software to generate business intelligence and reports. Data Presentation Layer, kinda sounds like the 7 layer OSI model… hmmm. But this represents the “prettified” information that is given to the end-user. USUALLY OLAP AND/OR REPORTING TOOL.
Tired yet? Think you could make it a little longer? Architecture Design has been kind of smashed into your head and I already know that it’s a pretty dry topic.