Data Lakes vs. Data Warehouses: Choosing the Right Storage

Enabling data scientists and analysts to access quantitative and qualitative data while considering factors like cost of storage & computation, scalability, availability, and security defines mechanism to have data in data lakes or data warehouse.

Data is like a ball, and each department inside an organization plays the ball according to its own set of rules and understanding of concepts.

For example, A Data Scientist needs to study raw data to understand the pattern formed across that data whereas a Data Analyst looks for more structured data to study and seek out analytics with a result-orientated business intelligence in mind. A Data governance team is more concerned about the budget and to lend ownership of the accessed data to the right person.

Let’s first understand what they are before we jump on the bandwagon to identify what to use.

Data Lakes

A data lake is a system consist of raw, unstructured data in its raw format. The purpose of data is not well defined, and it remains unstructured, decentralized, in short not very well defined. This data is not a good source for analytics purpose but is a powerful tool to define patterns and predictions.

A data lake is a replica of original data from other sources that can be collected on one-time basis or on regular basis at a specific time interval. As no processing is done previously, the storage cost of acquisition and the storage both costs relatively less.

Azure Data Lake Storage, AWS Lake Formation are utilized by ETL developers to setup up Data Lakes in their pipelines.

Amazon Athena is a utility provided by AWS to analyze unstructured or semi-structured data in a data lake to work further.

Characteristics And Concepts Associated With Data Lakes

Non integrated: As data is being collected from multiple sources in its original format, data may or may not be consistent.

Low Cost: The concept of data lakes relies on low cost of computation and its storage.

Volatile: As data being volatile and is being updated in batches in real-time, a spike and temporary changes in data might be noticed.

Data Ocean: All data lakes related to a specific business that contribute to its complete scope result in forming a Data Ocean.

Data reservoir: Part of data lake that has been partly transformed, modified or processed and is secured and ready for analysis is termed as Data reservoir.

Data swamp: Data Swamp is a graveyard of collected unprocessed large amount of data that is never being used.

An enterprise data lakes’ architecture compromises of following components:

Why Data Lakes?

Date lakes have certain advantages when in use like they have boundless scalability, speed-up operations (as data is stored in raw format and is not transformed), provides flexibility in managing and storing data. With Data lakes and the help of new tools support, all complex and advanced algorithms provide a base for quantitative data used for ML solutions.

Why Not Data Lakes?

Data lakes have certain limitations as well. As it doesn’t have a very well-defined schema and structure of data, they don’t provide analytical insights from previous findings. Hence, integrity & consistency of data is not maintained and the computation cost of data is very high.

Data Warehouse

Data warehouses consist of well defined, structural, and processed data. The purpose of the data is well defined and data marts can be generated on top of it for further classification and processed data storage. This type of data is a good candidate for analytics and usually have low latency in display information required.

A data warehouse is a storage of centralized, processed data and may or may not bear any traits of original data. The cost of processing that data is high, and it also takes time to store data as processing happens before storage.

ETL developers uses Snowflake. Google Big Query, Amazon Redshift etc. for analytics and business intelligence operations. As a top up, to execute queries on warehoused data, tools like Power BI, Tableau are utilized to display results and dashboards.

Characteristics And Concepts Of Data Warehouse

Integrated: Data is extracted differently from different sources but transformed in a same way, resulting in an integrated data.

Non-Volatile: Real time analytics are usually not required, and data is always processed in batches and there is no influence of short-lived changes.

Scalable: Data warehouse is easily scalable both horizontally and vertically.

Operational Data Storage: An interim storage for performing small, simple operations for real-time analysis.

Data marts: A specific part of Datawarehouse being allocated to a specific set of users or a group for performing independent operations without interfering with data available for other users or groups.

An enterprise data warehouse may have following components:

Why Data Warehouse?

Datawarehouse are certainly advantageous when it comes to provide accelerated business intelligence and analytical solutions. Data here is more of qualitative in nature and holds consistency and integrity. Data warehouse provide solutions based on past data results in historical analytical insights and intelligence. To integrate with other sources of data is easy and decision making for predictable solutions is quick, reliable and efficient.

Why Not Data Warehouse?

Datawarehouse often losses a lot of data during transformation, resulting in distortions. The cost of transformation is very high and is a time-consuming process.

Data Lakes VS Data Warehouse

	Data Lakes	Data Warehouse
Operation	Loading of Data is done before Transformation. Process is defined as ELT.	Loading of Data is done after Transformation. Process is defined as ETL.
Cost	Cost of Storage is low and gradually increase with increase in Data. Computation cost while storing is low but high while retrieving.	Cost of storage is high and only transformed data is stored. Computation cost is also high while storing and low while retrieving.
Processing	Processing of data happens while read operations are performed.	Processing of data happens while write operations are performed.
User Type	Data Scientist	Data Analyst
Application	Deep Learning, Real Time Analysis, Neural Network	Reports, Dashboards with abstract data
Data Type	Unstructured, raw	Structured, formatted

Generic Industry-wise Use Cases:

Banking & Finance: Majority of the data available is structured in this industry and a lot of analytical and experimental predication would be required. But as that data is structured a data warehouse would be an ideal fit.

Healthcare: It has lot of unstructured and structured data available in terms of physician notes, reports, test results, insurance data, medical devices data. And this industry requires both analytical and experimental data to work on. An ideal scenario in healthcare industry will be to use a combination of data lakes and data warehouse.

Education: Lot of unstructured data is available in terms of results, grades, attendance, activities, photos, videos, and there is no need for day-to-day analytics to be done. A data lake or ocean would be an ideal solution.

Marketing & Sales: Lot of patterns and predications has to be done to run campaigns and promos on daily basis hence Datawarehouse would be an ideal choice.

The above use cases are generic, and it all depends on requirements and data available to define what fits in best.

Which One Is Right For You?

Current infrastructure of applications is utilizing structured systems & storing data in RDBMS, CRM, ERP, etc.?

If so, Data warehouse is the right choice else choose Data Lake.

How diverse are your data sources?

If you have unified systems with structural data Datawarehouse is what you should opt for. But if the data comes from various sources like emails, logs, images, files, etc. choose data lakes. Datawarehouse may result in significant data loss while transforming hence having a Data Lake would be an ideal choice.

How foreseeable are your needs in terms of results for business?

If the needs are predictable and not experimental a Datawarehouse provides right solution and if the need is experimental a Data Lake would be ideal. In case your business needs are a combination of both then before transformation, choose Data lakes and after transformation choose Data warehouse.

Is costing an issue in preserving data in conventional databases?

If this is the case, a data lake is the ideal platform for storing historical data at a cheap cost without the need to delete data to keep costs in check.

Are You Looking for Expertise ETL/ELT Solutions With Data at Its Core?

At Growexx, we not only work as a technical partner for execution but we also work for you as a data driven consultancy by analysing the available data and knowing about your needs. So, on the basis of your requirements, we give you a roadmap on how we can deliver flexible, scalable, secure pipelines with affordable solutions.

To know more about Discovery workshop, contact our data engineering team.

Vikas Agarwal

Vikas Agarwal is the Founder of GrowExx, a Digital Product Development Company specializing in Product Engineering, Data Engineering, Business Intelligence, Web and Mobile Applications. His expertise lies in Technology Innovation, Product Management, Building & nurturing strong and self-managed high-performing Agile teams.