Enabling data scientists and analysts to access quantitative and qualitative data while considering factors like cost of storage & computation, scalability, availability, and security defines mechanism to have data in data lakes or data warehouse.
Data is like a ball, and each department inside an organization plays the ball according to its own set of rules and understanding of concepts.
For example, A Data Scientist needs to study raw data to understand the pattern formed across that data whereas a Data Analyst looks for more structured data to study and seek out analytics with a result-orientated business intelligence in mind. A Data governance team is more concerned about the budget and to lend ownership of the accessed data to the right person.
Let’s first understand what they are before we jump on the bandwagon to identify what to use.
A data lake is a system consist of raw, unstructured data in its raw format. The purpose of data is not well defined, and it remains unstructured, decentralized, in short not very well defined. This data is not a good source for analytics purpose but is a powerful tool to define patterns and predictions.
A data lake is a replica of original data from other sources that can be collected on one-time basis or on regular basis at a specific time interval. As no processing is done previously, the storage cost of acquisition and the storage both costs relatively less.
Azure Data Lake Storage, AWS Lake Formation are utilized by ELT developers to setup up Data Lakes in their pipelines.
Amazon Athena is a utility provided by AWS to analyze unstructured or semi-structured data in a data lake to work further.
Non integrated: As data is being collected from multiple sources in its original format, data may or may not be consistent.
Low Cost: The concept of data lakes relies on low cost of computation and its storage.
Volatile: As data being volatile and is being updated in batches in real-time, a spike and temporary changes in data might be noticed.
Data Ocean: All data lakes related to a specific business that contribute to its complete scope result in forming a Data Ocean.
Data reservoir: Part of data lake that has been partly transformed, modified or processed and is secured and ready for analysis is termed as Data reservoir.
Data swamp: Data Swamp is a graveyard of collected unprocessed large amount of data that is never being used.
An enterprise data lakes’ architecture compromises of following components:
Date lakes have certain advantages when in use like they have boundless scalability, speed-up operations (as data is stored in raw format and is not transformed), provides flexibility in managing and storing data. With Data lakes and the help of new tools support, all complex and advanced algorithms provide a base for quantitative data used for ML solutions.
Data lakes have certain limitations as well. As it doesn’t have a very well-defined schema and structure of data, they don’t provide analytical insights from previous findings. Hence, integrity & consistency of data is not maintained and the computation cost of data is very high.
A data warehouse is a storage of centralized, processed data and may or may not bear any traits of original data. The cost of processing that data is high, and it also takes time to store data as processing happens before storage.
Integrated: Data is extracted differently from different sources but transformed in a same way, resulting in an integrated data.
Non-Volatile: Real time analytics are usually not required, and data is always processed in batches and there is no influence of short-lived changes.
Scalable: Data warehouse is easily scalable both horizontally and vertically.
Operational Data Storage: An interim storage for performing small, simple operations for real-time analysis.
Data marts: A specific part of Datawarehouse being allocated to a specific set of users or a group for performing independent operations without interfering with data available for other users or groups.
An enterprise data warehouse may have following components:
Datawarehouse are certainly advantageous when it comes to provide accelerated business intelligence and analytical solutions. Data here is more of qualitative in nature and holds consistency and integrity. Data warehouse provide solutions based on past data results in historical analytical insights and intelligence. To integrate with other sources of data is easy and decision making for predictable solutions is quick, reliable and efficient.
Datawarehouse often losses a lot of data during transformation, resulting in distortions. The cost of transformation is very high and is a time-consuming process.
|Data Lakes||Data Warehouse|
|Operation||Loading of Data is done before Transformation. Process is defined as ELT.||Loading of Data is done after Transformation. Process is defined as ETL.|
|Cost||Cost of Storage is low and gradually increase with increase in Data. Computation cost while storing is low but high while retrieving.||Cost of storage is high and only transformed data is stored. Computation cost is also high while storing and low while retrieving.|
|Processing||Processing of data happens while read operations are performed.||Processing of data happens while write operations are performed.|
|User Type||Data Scientist||Data Analyst|
|Application||Deep Learning, Real Time Analysis, Neural Network||Reports, Dashboards with abstract data|
|Data Type||Unstructured, raw||Structured, formatted|
Banking & Finance: Majority of the data available is structured in this industry and a lot of analytical and experimental predication would be required. But as that data is structured a data warehouse would be an ideal fit.
Healthcare: It has lot of unstructured and structured data available in terms of physician notes, reports, test results, insurance data, medical devices data. And this industry requires both analytical and experimental data to work on. An ideal scenario in healthcare industry will be to use a combination of data lakes and data warehouse.
Education: Lot of unstructured data is available in terms of results, grades, attendance, activities, photos, videos, and there is no need for day-to-day analytics to be done. A data lake or ocean would be an ideal solution.
Marketing & Sales: Lot of patterns and predications has to be done to run campaigns and promos on daily basis hence Datawarehouse would be an ideal choice.
The above use cases are generic, and it all depends on requirements and data available to define what fits in best.
If so, Data warehouse is the right choice else choose Data Lake.
If you have unified systems with structural data Datawarehouse is what you should opt for. But if the data comes from various sources like emails, logs, images, files, etc. choose data lakes. Datawarehouse may result in significant data loss while transforming hence having a Data Lake would be an ideal choice.
If the needs are predictable and not experimental a Datawarehouse provides right solution and if the need is experimental a Data Lake would be ideal. In case your business needs are a combination of both then before transformation, choose Data lakes and after transformation choose Data warehouse.
If this is the case, a data lake is the ideal platform for storing historical data at a cheap cost without the need to delete data to keep costs in check.
At Growexx, we not only work as a technical partner for execution but we also work for you as a data driven consultancy by analysing the available data and knowing about your needs. So, on the basis of your requirements, we give you a roadmap on how we can deliver flexible, scalable, secure pipelines with affordable solutions.