The phrase ‘Data Engineering’ has been trending over the last couple of years. At the risk of oversimplifying, it refers to the underlying engineering needed for big data and analytics processes. It has also become a critical part of AI/ML applications since these rely heavily on large datasets as input.
Overall, data engineering can be split into various technology layers:
- Data storage – well, to store data!
- Data pipelines – to move data from one place to another
- Data transformation – to extract, load and transform data so it’s usable for analytics or AI
- Data analysis – to take in data, process it, and analyse it
- Data presentation – to present it in the form of dashboards/charts
In this blog, we focus on the first layer of data storage. Depending on the purpose and use of data, engineers use a database structure, data warehouse, or data lake keeping in mind how this data is going to be accessed in the future.
Fundamentals of a Database
Databases support Online Transaction Processing (OLTP) and are a collection of data or information. The data in the database is stored in Database Management Systems (DBMS) and facilitates interactions between users and applications.
Database Features: Many varieties of databases are used to store structured and/or semi-structured data, each having distinct characteristics:
- Relational Databases are used to store data in a tabular format with fixed rows and columns
There are also object-oriented databases, where information is represented in the form of objects (used in object-oriented programming), and distributed databases, where the database can be stored in multiple locations/systems.
A database enables the following capabilities:
- Ensures access to authorised users, thereby ensuring data security
- Ensure data integrity with ACID (Atomicity, Consistency, Isolation, Durability) transactions
- Allows query languages and APIs to interact easily with the data in the database
- Optimises query performance through indexing
- Facilitates full-text search of the database
Typically, a database is used to store transactional data and information, statistics, data from IoT devices, user information from mobile applications, etc.
While databases support OLTP, data lakes and data warehouses support Online Analytical Processing (OLAP), which collects data from multiple sources to enable reporting, analytics, and business intelligence (BI).
Understanding Data Warehouses
The data warehouse is used to store highly structured data – both current and historical data and information – from multiple sources. It can be called a giant database that has been optimised to enable analytics.
The data stored in the data warehouse can be either raw ingested data or highly curated, cleansed, filtered, and aggregated data.
Extract, transform, and load (ETL) processes are used to move data from the source to the data warehouse. This movement can be scheduled at a specific periodicity and therefore does not always provide access to data in real-time. One needs to use streaming data tools like Striim to feed real-time data into business intelligence workloads.
A Data Warehouse structure works best with structured data because of having a pre-defined and fixed relational schema. In some cases, semi-structured data may also be supported. Once the data has been moved to the warehouse, BI tools are used for running analytics, drawing insights, or generating reports.
In today’s world, companies are moving to cloud-based warehouses like Amazon Redshift, Azure SQL Data Warehouse, etc. to run modern BI applications, however that may not be the best solution for every business.
The primary function of a data warehouse is to enable in-depth analysis. It can be used to store very large amounts of historical data but is not meant for the transaction and concurrency needs of an application. For that, one or more databases will be needed.
Deconstructing Data Lakes
A data lake is similar to a data warehouse in being a repository of large volumes of current and historical data drawn from multiple sources. Data can be stored in a variety of ways in the data lake, including as a database.
The data can be structured or semi-structured, stored in different formats such as JSON, BSON, CSV, TSV, Avro, ORC, and Parquet. While providing a database-like view of the data stored in the data lake, it can also handle analytical workloads similar to a data warehouse.
In addition to structured and semi-structured data, data lakes can also be used to store unstructured data. Even JSON documents, PDFs and audio files can be stored without having to be transformed.
This blend of structured and unstructured data expands the scope of the insights that data lakes can provide and find solutions to unexpected problems.
Data engineering teams are opting for data lakes to build machine learning and predictive analytics models. Much like data warehouses, data lakes cannot also be used for fulfilling the transaction and concurrency needs of an application.
According to a data engineering expert at Merit, “The role of data lakes is expected to become more and more prominent over the years. We expect to see a lot of innovation in this space, especially because of an explosion of unstructured data from photos, videos, pdf files, chat messages, e-commerce platform comments, and so on.”
To reiterate, each of the following data storage approaches has a specific purpose:
- Databases: Store current data needed to run an application, and ideal for transactions
- Data Warehouses: Store current and historical data from multiple sources using a predefined and fixed schema and can be used for analytics
- Data Lakes: Store current and historical data from a variety of systems in raw form and can be used for predictive analytics, machine learning, and complex BI requirements
Which one is best for your organisation will depend on your business needs; Sometimes, it can be a combination of all three.
Merit Group’s expertise in Data Storage
Merit Group partners with some of the world’s leading B2B intelligence companies within the publishing, automotive, healthcare, and retail industries. Our data and engineering teams work closely with our clients to build data products and business intelligence tools that optimise business for growth.
The first step to getting BI right is to make sure your data warehousing and storage strategy are optimal – both from a technology and cost perspective. Merit’s team of data engineers will work closely with your CIO and other key decision makers to lay the foundation for a robust data intelligence and BI roadmap that is future-proof.
If you’d like to learn more about our service offerings or speak to a data science expert, please contact us here: https://www.meritdata-tech.com/contact-us
Related Case Studies
A Digital Engineering Solution for High Volume Automotive Data Extraction
Automotive products required help to track millions of price points and specification details for a large range of vehicles.
Bespoke Data Engineering Solution for High Volume Salesforce Data Migration
A global market leader in credit risk and ratings needed a data engineering solution for Salesforce data migration.