Data Warehouse Down, Data Lake Up Next!
As soon as I thought we were done, I hear of data lakes. Oh woe is data…
I thought everything was good, I had wrapped up my series on data warehouses and I was slowly working on topics more foundational and basic to my knowledge so far. I had started the Back2Basics series as a way for me to solidify not only my knowledge of beginner concepts but also as a way to teach others in a less formal, yet still informative manner… until that day. Like most people, I have a bajillion tabs on Chrome so as I am ctrl+tabbing through all of them, I accidentally end up on my Gmail… and I see an email. An email about how the data warehousing professor at my school was able to get Bill Inmon, one of the founding fathers of the data warehouse, to present to our department a webinar in regards to Data Lakes and Integration and Using Unstructured Data with the Modern Enterprise Data Warehouse. Fascinated by the concept of a “Data Lake”, was soon followed up by hours of googling and reading related links and clicking through several Google and Bing search page results. Ooooff that’s a lot.
Well, here is a compilation of what I’ve found out as well as a little discussion of what I’ve found out when I searched up “Data Warehouses vs Data Lakes”. Anyways, I’ve talked about the enterprise data warehouses and how with the big data hype that happened a while back, there has been a huge phenomenal and technological shift in methodologies, all of which have been foundational in business intelligence. Although Health Catalyst has the “Late Binding” model in data warehouses, because that is more of a proprietary methodology, I’m going to refer to data warehouses as the concept of binding data to specific static structures and categories that define the kind of data analysis at an early point of “entry”. What this means is that as data is entering into the data warehouse, it’s already being sorted into buckets that will be running the data through several tools and algorithms. Each of these buckets, will of course be vastly difference, and thus with the data it’s usually in a stable and not very flexible position. Although at the start of big data and business intelligence, this was a terrific method, but as decisions became more complex and the needs of organizations became more convoluted, there needed to be flexibility in what a data warehouse could be used for and how data could be analyzed. It’s no longer about using data to support decisions and insights, it’s about learning how to use and analyze data to make business decisions and learn new insights. It’s only once data has begun being analyzed, does it start to allow for new insights to be developed and thus new questions to be formed, and that is why the “late binding” data warehouse model was so great, it allowed for the data to essentially be categorized later on during the entire process. With all the technology we have today, the rate of data is overwhelming and the conventional approaches for data storage and management are starting to be obsolete or just not enough. Unless organizations continually try and adapt, who knows how much critical knowledge that can help make business decisions or business insight is being lost?
Ability to navigate from a starting question or data point to different directions, slicing and dicing the data in any ad-hoc way that the train-of-thought of analysis demands is essential for real data discovery.
This is primarily more important for healthcare because of the constant changing environment for the data. The data can be used for a variety of purposes and for a variety of situations, thus requires the flexibility of the “late binding” data warehouse, hence an “ad-hoc” basic. Data lakes are similar in that they are a hub/repository for data that the organization would have access to, it’s also similar in that the data stored is also in it’s most raw form without any normalization or restrictive schema. This provides an unlimited window of view of data for anyone to run ad-hoc queries and perform cross-source navigation and analysis on the fly. Successful data lake implementations respond to queries in real-time and provide users an easy and uniform access interface to the disparate sources of data. However, at this point, I have yet really been able to see the distinct differences between a data lake and a data warehouse.
However, I am going to do some redefinition of a data warehouse and a data lake. Before, I had it defined as a subject-oriented, integrated, time-variant, and non-volatile collection of data to support business decisions and generate insight. As a result, a data warehouse would be used to present a single integrated version of the truth. A data warehouse should be:
- a central repository of data from various sources
- stores data from current and historical purposes in helping to create dashboards, reports, and other visuals
- necessary to aid in business decisions and gathering insight
- abstract picture of the business organized by the “subject areas”
- Because of the data warehouse, it goes through the ETL process (Extract, Transform, Load) before being plugged into the data warehouse for its defined purpose. Works as the “schema-on-write” with the ETL methodology. Necessary to design the data model and analytics framework. But this also means we need to know how the data will be used in the future.
For a data lake though, Blue Granite provides a great analogy on how a data lake works. In data warehouses, data marts (the subsets of more specialized data for specific analyses) are seen more of an optional asset, but in data lakes, they appear as a “bottle of water”… “cleansed, packaged, and structured for easy consumption for the necessary person”. Whereas the data source, are considered as the “streams” to the lake. Then users would have access to the lake to examine, take samples, or dive in. As I mentioned with the bottle of water, the raw data is extracted from the stream and loaded into the lake, however until it needs to be used, it won’t be bottled (transformed) as necessary. However, there are some more specifics that I believe are crucial:
- All data is loaded from source systems, nothing is turned away. In a sense, instead of ETL, it is ELT, thus it is not transformed until it is necessary. Data is kept as raw as possible and is only transformed when the needs are required. Considered “schema-on-read” allows for as needed analysis and frameworks created ad hoc in an iterative methodology. Great Resource to Read
- Not necessarily unstructured, but just not transformed. A data lake will be composed of several directories with the data, and because it can come in any format, it’ll be a mix of structured and unstructured data. Just there is no need for a formal structure and schema, or known as a BDUF (Big Design Up Front)
- Because of the different data types, they need to be treated differently but in the end needs to be integrated into some form of a cohesive lake. Using hadoop, there can be data that goes through data reduction to slim down the volume of data to help make only relevant and useful data easy to find. Other data can be treated with ETL/ELT, and once that has been integrated can be analyzed as necessary when drinking the water. Other forms of data should be disambiguated and be free of structure to help with analysis and future management.
- Immediate access to data. The users can shape the data as needed to meet the necessary requirements. It speeds up delivery and offers flexibility since there is no governing group that has to perform ETL.
- Like any lake, there is a life cycle and as data is used it ends up in a archival data pool where it eventually gets recycled out as detritus on the bottom of the lake. The processing is on the data, there is no need to move the data to the processing areas. The only thing that leaves is the derived insights (bottled water).
Andrew C. Oliver writes on his blog four simple steps in creating a data lake as follows:
- Identify a few use cases
- Build the lake
- Get data from lots of different sources into the lake
- Provide a variety of fishing poles and see who lands the biggest and best trout (or generates the most interesting data-backed factoid)