Data Lake – Differences from Data Warehouses

Comparative Differences!

As I mentioned before, data lakes take in all data in its raw form. As a result, this can make it a bit more time intensive and resource consuming in perform analytics on the data in the lake, but it also increases in versatility and flexibility. Because a data warehouse essentially profiles business processes and profiles data first, it ends up with structured data that is suitable for reporting. A data lake does not do this and it retains all data. Data is stored for as long as the organization is running, thus allowing for a huge reservoir for reference at any point in time for analyses. In addition, assuming we’re using the traditional data warehouse model, data lakes can adapt easily to change. A common complaint is the time it takes data warehouses to change, especially because how much time is spent during development in getting the structure and architecture right. Although the design itself can change relatively easily, the warehouse’s complex data loading process (ETL) can take up much more developer resources and time. With quick changes and raw data, it’s much easier to quickly try a new method of analyses and if it results in something useful, a more formal schema can be applied along with automation for reusability can be developed. If not, then just carry on because nothing was finalized. With these slight differences, it can add together to provide faster insights. As I mentioned earlier, data should now be used to help develop and provide new insights that can help us to ask new questions to provide guidance in new business decisions. Compared to a traditional data warehouse, a data lake can help get results faster. That way you don’t end up pulling crap out of your butts like Dilbert and Co.

differences to data lakes

Image Source

or this.

differences to data lakes

Image Source

Now, this doesn’t mean that a normally functioning data warehouse should be scrapped, but if organizations are running into problems mentioned above, especially in industries that are volatile and often require new forms of thinking and methodology, a data lake can be implemented alongside a data warehouse. Data warehouses should remain as a clean and vetted data storage for users from various levels to help with making decisions. Without it, decision makers are trying to make these perfect and critical decisions, with imperfect information and that can be one of the scariest decisions to make. If anything, data lakes can act as an archive repository for the warehouse or for extra storage of replicate data that may not be deemed “fit” to be stored or processed in a traditional data warehouse, which can then allow users with access to more data, transformed and raw. In a gaming sense, a data lake is considered a “Free For All” round where data dukes it out with each other and vies for the top spot to be manipulated by the masters (the users). In the end, I see data lakes as “big data” storage… its methodologies and tools can bring useful bits into the structured environment  that allow people to fish what they need that can then go into a structured data warehouse. SYNERGY. FUUUSSIION. HAH. 

fussioon hah! differences

Image Source

“In broad terms, data lakes are marketed as enterprisewide data management platforms for analyzing disparate sources of data in its native format,” said Nick Heudecker, research director at Gartner. “The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”

 

Quote Source

The difficulty is that data lakes rely on users and many people being highly skilled at data manipulation and analysis, otherwise, a data lake implementation would be time and resource intensive only to be used by a few. Getting value out of a large project is the responsibility of the organization and end user right? As a result, this technology has to be used and applied enough times to get that value in business decision and insight, otherwise it’ll just look like a lake of disconnected data pools and poorly managed information “data marts”. If the data lake is not implemented well, the data will just be dumped and it could lead into a “data swamp”. It doesn’t matter how vast the data lake is, or how deep it can be if the organization hasn’t figured out how to use it to drive real improvements with analytics. Especially in healthcare, there is no point if the physicians and other health care providers are trying to figure out how to drink the data lake’s water because they don’t know how to access it. Depending on the industry and organization, the data can become complex and the processing primitive. As I mentioned, if it is not well maintained, then it can end up being more of a staging area than a data lake that requires a variety of tools to be usable by more than just the data analysts. It’s still considered a relatively new “technology” so there hasn’t been enough variety and sample size for best practice approaches to be developed.

data lake comparison

Image Source

But like most new things, it must overcome the resistance and be subject to losing time, credibility, data, sanity, or all of these.

Staging Area Data Lake Source

Governance and Maintenance

Thus, another important factor is being able to control and determine data quality of the incoming source systems along with the ability to automate and add descriptive metadata and scripting a mechanism to maintain it. The data governance that must be applied is important to maintain “who may” and “what they” can do with the data. Because of the way that the data is stored in these situations, it can provide difficult because of the different formats, especially when they are “over time deviated from normal form” (OTDNF).With the data available, informaticians, analysis, and data sciences can create analyses and solutions that are “point focused” and “specific” to business needs, however, this leads to business silos that data warehouses provide, but now in a unstructure “free-for-all” environment.

Metadata

To maintain the data lake and prevent it from becoming a data swamp, it’s important to collect, update, and manage a lot of metadata through a metadata life cycle management process. Metadata itself is a can of worms, but to simplify it, metadata can be divided into descriptive, structural, and administrative.  In a nutshell, it’s similar in that it must ensure :

  1. Availability – Stored where it can be accessed and indexed for easy retrieval.
  2. Quality – Consistent quality for integrity. Single source of truth.
  3. Persistence – Kept over time for analysis and other purposes
  4. Open License – Allows reuse

The lifecycle is also more comprehensive because there is metadata before the data is created and there is still metadata that describes the data being removed. It is about creating, maintaining, updating, storing, publishing, and handling deletion. 

metadata lifecycle differences

Source Image

Aside from metadata, there have been other solutions mentioned that can help prevent the lake from becoming a swamp, such as integration mapping, context, and metaprocess. All of these solutions are meant to ensure accuracy, availability, completeness, conformance, consistence, credibility, relevance, and timeliness. There should also be applications developed for each individual organization to help increase the efficiency and effectiveness thus allowing any kind of user to access and utilize the data to its potential.

Security and Privacy Concerns

Unfortunately, many data lakes are being used for data whose privacy and regulatory requirements are likely to represent threat exposure, thereby increasing risk or increasing the impact of a vulnerability being exploited. Lastly, a concern is security and privacy. Especially in healthcare, where 99% of all incoming information will be considered protected health information, how will organizations add oversight to a system that is meant to run rampant? The data lake was created for those that need immediate access to help build data-driven solutions, create analyses, and formulate questions meant to drive business decisions and develop business insight. The data warehouse lacked the flexibility and thus in “revolt” data lakes were developed to help address real needs and convenient way to address them without having to wait for corporate IT and its repressive bureaucracy. Though I understand the want to catalog and secure data, the slow and cumbersome process can inhibit insight-driven innovation… agile development is a testimony to that. Data lakes are meant to be chaos filled and rampant with disorganization, but it’s a “necessary evil” to stimulate and make innovation possible.

Article by Sir. Lappleton III

I'm a happy-go-lucky college student that started a blog as a way to not only document my education and my experiences, but also to share it with whoever stumbles upon my site! Hopefully I can keep you guys entertained as well as learn about a few things from IT as well as from my time and experiences as I plunge deeper and deeper into healthcare! A couple of my areas of focus is data management, system security (cyber security), as well as information technology policy.