Data Lake vs Data Warehouse
In my recent conversations with different teams, I found there are multiple understandings of the term Data Lake. It seems to be used interchangeably with the term Data Warehouse.
Data Warehouse (DWH) term exists even before I started by career and will remain important component of BI framework for foreseen future. There are definitions and details around this term which would need a research to compile. In next few paragraphs I will try to put my thoughts around differences of these two terms Data Warehouse and Data Lake.
Data Lake is relatively new term coined by James Dixon in October 2010.It was defined as – If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in or take samples.
I love this simple definition.
In my opinion most of the clients with new age BI requirements have better alignment with Data Lakes instead of traditional DWH.
Data Warehouses are designed for known KPIs and Dashboards. These are very good in responding to known questions. But when business teams have multiple ad hoc questions and are looking for dynamic data science analysis, there is always evolving requirement of more data items to be available. Data Lake has answers to all these new age BI requirements.
Following is the summary of some of the technical differences between Data Warehouses & Data Lake:
It’s interesting to see Data Science and BI teams working with Data Lake, come up with lots of interesting KPI and analysis which is on the other hand not possible with traditional Data Warehouse design. Data Lake is very productive for these types of use cases:
- Making Data available to Business for quick Analysis
- Enhancing existing KPI
- Identifying new KPI
It’s important to mention this additional flexibility comes with an additional cost. Some data processing will always be required before reaching to any KPI. Any such processing will not be required with pre-processed data in DWH.
In my experience Data Lake works well with organizations that do not have frozen requirement at the start of BI cycle. Specific KPIs could be designed and implemented at later stages. To conclude Data Warehouse are designed for known KPIs whereas Data Lakes are answer to unknown future queries.
– Written By
Big Data Architect