Query Iceberg Tables on MinIO with Dremio
MinIO is ideal for storing the unstructured, semi-structured and structured data that feed enterprise data lakes. Data lakes form the backbone of advanced analytics, AI/ML and the data-driven business overall. The architecture that has emerged as a best practice is to combine cloud-native data management, data exploration and analytics tools with data saved to MinIO in Apache Iceberg, an open table format. The result is a data lake that is scalable, performant, efficient and seamlessly integrated with other cloud-native tools.
In the previous blog post, Dremio and MinIO on Kubernetes for Fast Scalable Analytics, we discussed how to set up Dremio on Kubernetes and query data residing in MinIO. MinIO is S3-compatible, Kubernetes-native and the fastest object storage on the planet. Packed with features and capabilities to help you make the most of your data lake, MinIO guarantees durability and immutability. Functionality is complemented by security: MinIO encrypts data in transit and on drives, and regulates access to data using IAM and policy based access controls (PBAC).
In this post, we will set up Dremio to read files like CSV from Minio, and we will also access the Apache Iceberg table that was created using Spark as shown here. If you haven't set up Dremio and MinIO yet, you’ll need to follow the walkthrough in the previous post. To learn more about building data lakes with Iceberg and MinIO, please see The Definitive Guide to Lakehouse Architecture with Iceberg and MinIO.
Add MinIO as Datasource
Once Dremio is up and running, login and click on Add Source
at the bottom left
Then select Amazon S3
under Object Storage
Fill in the details like the Name
of the connector, AWS Access Key
, AWS Access Secret
and under Buckets
, please add openlake
as shown below
Next, choose the Advanced Options
on the left side of the menu, click to Enable compatibility mode
, and add 2 new Connection Properties
- fs.s3a.endpoint - play.min.io
- fs.s3a.path.style.access - true
Add openlake
to the Allowlisted buckets
and hit save
as shown in the image below
Accessing CSV File
Let's use the taxi-data.csv
that we used earlier in the Spark-Iceberg blog post. If you haven’t already completed that tutorial, then please follow these instructions to get the data into MinIO. Click on the openlake
datasource that we just setup
Navigate to openlake/spark/sample-data
and you should see the taxi-data.csv
file. Click on Format File
as shown below
Dremio should be able to infer the schema of the CSV file, but we need to tweak some things as shown below
Click on Save
and you should be navigated to the SQL Editor. Let’s run a simple query to see the data
SELECT count(*) FROM openlake.openlake.spark."sample-data"."taxi-data.csv"; |
It will take some time to load the data and compute the count. Once done you should see the result as shown below
It should take a little over 2 mins to complete the above query, depending on the size of the data and the compute resources available to Dremio.
We can perform other query operations, but we will not be able to use Dremio to alter the column names or time travel to previous versions of the data. In order to do that we will use Apache Iceberg.
Accessing Iceberg Table
We will continue to use the nyc.taxis_large
Iceberg table that we created in this post to access the data using Dremio. Click on the openlake
datasource that we just set up
Navigate to openlake/warehouse/nyc
and you should see taxis_large
. Click on the Format File
and Dremio should be able to infer the schema of the Iceberg table as shown below
Click on Save
and you will be taken to the SQL Editor. Let’s run a simple query to see the data
SELECT count(*) FROM openlake.openlake.warehouse.nyc.taxis_large; |
Now we’ve seen the first benefit that Iceberg brings to our data lake. The query execution time is much faster than it was for the CSV file, <1s
, as shown below
Note: we are seeing all the records in the table since we switched to the earliest snapshot of the table via Iceberg’s time travel when following the instructions in Manage Iceberg Tables with Spark.
At the time of this writing, Dremio does not support time travel via snapshots or provide access to any metadata tables, so we will not be able to fully leverage Iceberg’s features and access the data in the previous snapshots. We can use processing engines like Spark to access the data in previous snapshots.
Update Table Operations
Let's populate the fare_per_distance
column by dividing the fare
by trip
and then query the data. You can execute the following queries in the SQL Editor all at once
SELECT VendorID |
In the above query, we are fetching the first 15 records from the Iceberg table. There we will notice fare_per_distance
is null
, so then we execute the UPDATE
query to populate the fare_per_distance
column. Finally we fetch the first 15 records again and we now see that the fare_per_distance
column is populated.
Delete Table Operations
Let's try to delete some records from the table. You can execute the following queries in the SQL Editor all at once
DELETE FROM openlake.openlake.warehouse.nyc.taxis_large WHERE fare_per_distance > 4.0 OR distance > 2.0; |
Above we are deleting records where fare_per_distance
is greater than 4.0 or distance is greater than 2.0 and then we are deleting records where fare_per_distance
is null, and finally we are counting the number of records in the table.
Monitor Jobs
We can monitor the jobs that are running or already completed in Dremio by clicking on the Jobs
tab on the left side of the menu as shown below
Jobs can be filtered based on various parameters depending on the user's needs.
We can also see the details of the job like the memory usage breakdown of the total execution time etc., by clicking on the job name as shown below.
We have now set up Dremio to read data from MinIO and access the Iceberg table created using Spark. The process we have demonstrated can easily be applied to analytics on your data lake.
Stay Cool with Iceberg, Dremio and MinIO
This blog post walked you through working with an Iceberg table saved to MinIO using Dremio. The open source community around Apache Iceberg is active, vibrant and robust. New functionality and integrations are being added constantly. Iceberg’s rapid adoption means that there are plenty of compatible application frameworks and learning resources available.
MinIO is built to power data lakes and the analytics and AI that runs on top of them. MinIO includes a number of optimizations for working with large datasets consisting of many small files, a common occurrence with any of today’s open table formats. The combination of scalability and high-performance puts every workload, no matter how demanding, within reach. See our latest benchmark for more details.
Try Dremio on MinIO today. If you have any questions or want to share tips, please reach out through our Slack channel or drop us a note on hello@min.io.