Many of us have made changes to data based on requirements that later evolve. By the time we realize it, it becomes impossible to roll back. Change isn't just a constant; it's an essential part of managing data that demands a sophisticated approach. Enter the Git-like functionality of Dremio's data catalog Nessie.
Just as Git has become fundamental to software development, data engineers need similar tools to work in parallel, compare data versions, promote changes to production and roll back data when needed. Nessie provides data engineers with a Git-like version control system for managing data versions, branches, merges, and commits. This can be very helpful when multiple data engineers are working with and transforming data at the same time. Nessie allows each engineer to work in separate branches while maintaining a single source of truth in the form of the main branch. This functionality empowers data engineering teams to maintain data quality collaboratively in the face of unrelenting change.
This article provides a step-by-step guide demonstrating how Nessie, Dremio and MinIO work together to enhance data quality and collaboration in your data engineering workflows. Whether you're a data engineer, ML engineer, or just a modern data lake enthusiast, this blog equips you with the knowledge and tools needed to effectively enhance your data versioning practices.
Understanding the Basics: Nessie Workloads
Nessie allows for Git-like workloads that let you test, develop and push to production. Let's break down some key concepts:
- Branches: Just like in Git, branches in Nessie allow you to work on different data engineering tasks concurrently. For example, you might have a branch for feature development, data cleaning and data transformation. Each branch can have its own set of data changes.
- Commits: In Nessie, a commit represents a snapshot of the data at a specific point in time. When you make changes to your data, you create a new commit, which records those changes. Commits are linked to a specific branch. When you need to revert your data to a stable or known-good state, Nessie makes it easy to select a specific commit and roll back to that data version. This ensures data quality and consistency.
- Merges: Nessie allows you to merge the changes from one branch into another. This is similar to merging code changes in Git. When you merge a branch, the data changes made in that branch become part of the target branch.
Dremio is a distributed analytics engine that operates as an open-source platform, offering an intuitive self-service interface for data exploration, transformation and collaborative work. Its design is grounded in Apache Arrow, a high-speed columnar memory format.
MinIO is high-performance object storage. Renowned for its exceptional speed and scalability, MinIO serves as a pivotal component in building and maintaining modern data lake infrastructures. MinIO empowers architects to efficiently manage and store massive volumes of data on-prem, on bare metal, on the edge, or on any of the public clouds.
Apache Iceberg is an open table format suited for managing large volumes of data in data lakes. Unique features like time travel, dynamic schema evolution, and Partition Evolution make it a game-changer, allowing query engines to work concurrently on the same data safely and efficiently. See The Definitive Guide to Lakehouse Architecture with Iceberg and MinIO for more information about Iceberg features.
To begin, open a terminal and navigate to the folder where you cloned/downloaded the repo, and run the command below to start up Dremio.
docker-compose up dremio
Wait a few moments and then navigate to
http://localhost:9047 to access Dremio. Update the requested fields and then click Next.
Next, run the following command to start MinIO.
docker-compose up minio
docker-compose up command below will start up Nessie.
docker-compose up nessie
http://localhost:9001 to log in to MinIO with the username and password of
minioadmin:minioadmin. You’ll be prompted to create a bucket.
Make one named
Then head back over to Dremio at
http://localhost:9047 and click on Add Source and select Nessie.
- Set the Name to
- Set the endpoint URL to
- Set the authentication to
Don’t click Save yet. Instead, in the navigation panel on the left, click Storage. MinIO is S3-API compatible object storage and can use the same connection pathways as AWS S3.
- For your access key, set
- For your secret key, set
- Set root path to
Scroll down for the next set of instructions.
- Click on the
Add Propertybutton under
Connection Propertiesto create and configure the following properties.
Then click Save. You should now see Nessie Catalogs in your data sources.
Create a Data Source
In Dremio navigate to SQL Runner on the left-hand side. Make sure that
Context in the upper right area of the text editor is set to our Nessie source. Otherwise, you will have to reference the context like
nessie.SalesData instead of just
SalesData to run this query. Copy and paste the SQL below and run.
CREATE TABLE SalesData ( id INT, product_name VARCHAR, sales_amount DECIMAL, transaction_date DATE ) PARTITION BY (transaction_date);
Run the query below to insert data into the table you just created.
INSERT INTO SalesData (id, product_name, sales_amount, transaction_date) VALUES (1, 'ProductA', 1500.00, '2023-10-15'), (2, 'ProductB', 2000.00, '2023-10-15'), (3, 'ProductA', 1200.00, '2023-10-16'), (4, 'ProductC', 1800.00, '2023-10-16'), (5, 'ProductB', 2200.00, '2023-10-17');
Navigate back to MinIO to see that your data lake has been populated with the Iceberg tables.
Branching and Merging with Nessie
Return to Dremio at
http://localhost:9047. Begin by querying a table on the main branch using the
AT BRANCH syntax:
SELECT * FROM nessie.SalesData AT BRANCH main;
Create an ETL (Extract Transform and Load) branch to allow you to play around with and transform data without impacting production.
CREATE BRANCH etl_06092023 in nessie
Within the ETL branch, insert new data into the table:
USE BRANCH etl_06092023 in nessie; INSERT INTO nessie.SalesData (id, product_name, sales_amount, transaction_date) VALUES (6, 'ProductC', 1400.00, '2023-10-18');
Confirm the immediate availability of the new data within the ETL branch:
SELECT * FROM nessie.SalesData AT BRANCH etl_06092023;
Note the isolation of changes from users on the main branch:
SELECT * FROM nessie.SalesData AT BRANCH main;
Merge the changes from the ETL branch back into the main branch:
MERGE BRANCH etl_06092023 INTO main in nessie;
Select the main branch again to see that the changes have indeed been merged.
SELECT * FROM nessie.SalesData AT BRANCH main
This branching strategy enables data engineers to independently handle numerous transactions across multiple tables. When they are ready, data engineers can merge these transactions into a single, comprehensive multi-table transaction within the main branch.
This blog post delved into the power of Git-like version control in data engineering, emphasizing how Nessie seamlessly manages data versions, branches and merges. This step-by-step guide demonstrates how Nessie, in collaboration with Dremio and MinIO, as the object storage foundation, enhances data quality and collaboration in data engineering workflows.