Query Iceberg Tables on MinIO with Dremio

Query Iceberg Tables on MinIO with Dremio

MinIO is ideal for storing the unstructured, semi-structured and structured data that feed enterprise data lakes. Data lakes form the backbone of advanced analytics, AI/ML and the data-driven business overall. The architecture that has emerged as a best practice is to combine cloud-native data management, data exploration and analytics tools with data saved to MinIO in Apache Iceberg, an open table format. The result is a data lake that is scalable, performant, efficient and seamlessly integrated with other cloud-native tools.

In the previous blog post, Dremio and MinIO on Kubernetes for Fast Scalable Analytics, we discussed how to set up Dremio on Kubernetes and query data residing in MinIO. MinIO is S3-compatible, Kubernetes-native and the fastest object storage on the planet. Packed with features and capabilities to help you make the most of your data lake, MinIO guarantees durability and immutability. Functionality is complemented by security: MinIO encrypts data in transit and on drives, and regulates access to data using IAM and policy based access controls (PBAC).

In this post, we will set up Dremio to read files like CSV from Minio, and we will also access the Apache Iceberg table that was created using Spark as shown here. If you haven't set up Dremio and MinIO yet, you’ll need to follow the walkthrough in the previous post. To learn more about building data lakes with Iceberg and MinIO, please see The Definitive Guide to Lakehouse Architecture with Iceberg and MinIO.

Add MinIO as Datasource

Once Dremio is up and running, login and click on Add Source at the bottom left

add_source

Then select Amazon S3 under Object Storage

select_s3

Fill in the details like the Name of the connector, AWS Access Key, AWS Access Secret and under Buckets, please add openlake as shown below

s3-details

Next, choose the Advanced Options on the left side of the menu, click to Enable compatibility mode, and add 2 new Connection Properties

  • fs.s3a.endpoint - play.min.io
  • fs.s3a.path.style.access - true

Add openlake to the Allowlisted buckets and hit save as shown in the image below

s3-details2

Accessing CSV File

Let's use the taxi-data.csv that we used earlier in the Spark-Iceberg blog post. If you haven’t already completed that tutorial, then please follow these instructions to get the data into MinIO. Click on the openlake datasource that we just setup

source

Navigate to openlake/spark/sample-data and you should see the taxi-data.csv file. Click on Format File as shown below

format-file

Dremio should be able to infer the schema of the CSV file, but we need to tweak some things as shown below

schema

Click on Save and you should be navigated to the SQL Editor. Let’s run a simple query to see the data

SELECT count(*) FROM openlake.openlake.spark."sample-data"."taxi-data.csv";

It will take some time to load the data and compute the count. Once done you should see the result as shown below

count

It should take a little over 2 mins to complete the above query, depending on the size of the data and the compute resources available to Dremio.

We can perform other query operations, but we will not be able to use Dremio to alter the column names or time travel to previous versions of the data. In order to do that we will use Apache Iceberg.

Accessing Iceberg Table

We will continue to use the nyc.taxis_large Iceberg table that we created in this post to access the data using Dremio. Click on the openlake datasource that we just set up

source

Navigate to openlake/warehouse/nyc and you should see taxis_large. Click on the Format File and Dremio should be able to infer the schema of the Iceberg table as shown below