Query Iceberg Tables on MinIO with Dremio
MinIO is ideal for storing the unstructured, semi-structured and structured data that feed enterprise data lakes. Data lakes form the backbone of advanced analytics, AI/ML and the data-driven business overall. The architecture that has emerged as a best practice is to combine cloud-native data management, data exploration and analytics tools with data saved to MinIO in Apache Iceberg, an open table format. The result is a data lake that is scalable, performant, efficient and seamlessly integrated with other cloud-native tools.
In the previous blog post, Dremio and MinIO on Kubernetes for Fast Scalable Analytics, we discussed how to set up Dremio on Kubernetes and query data residing in MinIO. MinIO is S3-compatible, Kubernetes-native and the fastest object storage on the planet. Packed with features and capabilities to help you make the most of your data lake, MinIO guarantees durability and immutability. Functionality is complemented by security: MinIO encrypts data in transit and on drives, and regulates access to data using IAM and policy based access controls (PBAC).
In this post, we will set up Dremio to read files like CSV from Minio, and we will also access the Apache Iceberg table that was created using Spark as shown here. If you haven't set up Dremio and MinIO yet, you’ll need to follow the walkthrough in the previous post. To learn more about building data lakes with Iceberg and MinIO, please see The Definitive Guide to Lakehouse Architecture with Iceberg and MinIO.
Add MinIO as Datasource
Once Dremio is up and running, login and click on Add Source
at the bottom left
Then select Amazon S3
under Object Storage
Fill in the details like the Name
of the connector, AWS Access Key
, AWS Access Secret
and under Buckets
, please add openlake
as shown below
Next, choose the Advanced Options
on the left side of the menu, click to Enable compatibility mode
, and add 2 new Connection Properties
- fs.s3a.endpoint - play.min.io
- fs.s3a.path.style.access - true
Add openlake
to the Allowlisted buckets
and hit save
as shown in the image below
Accessing CSV File
Let's use the taxi-data.csv
that we used earlier in the Spark-Iceberg blog post. If you haven’t already completed that tutorial, then please follow these instructions to get the data into MinIO. Click on the openlake
datasource that we just setup
Navigate to openlake/spark/sample-data
and you should see the taxi-data.csv
file. Click on Format File
as shown below
Dremio should be able to infer the schema of the CSV file, but we need to tweak some things as shown below
Click on Save
and you should be navigated to the SQL Editor. Let’s run a simple query to see the data
It will take some time to load the data and compute the count. Once done you should see the result as shown below
It should take a little over 2 mins to complete the above query, depending on the size of the data and the compute resources available to Dremio.
We can perform other query operations, but we will not be able to use Dremio to alter the column names or time travel to previous versions of the data. In order to do that we will use Apache Iceberg.
Accessing Iceberg Table
We will continue to use the nyc.taxis_large
Iceberg table that we created in this post to access the data using Dremio. Click on the openlake
datasource that we just set up
Navigate to openlake/warehouse/nyc
and you should see taxis_large
. Click on the Format File
and Dremio should be able to infer the schema of the Iceberg table as shown below