Query Iceberg Tables on MinIO with Dremio

Dileeshvar Radhakrishnan Dileeshvar Radhakrishnan

on Apache Iceberg 19 April 2023

Query Iceberg Tables on MinIO with Dremio

MinIO is ideal for storing the unstructured, semi-structured and structured data that feed enterprise data lakes. Data lakes form the backbone of advanced analytics, AI/ML and the data-driven business overall. The architecture that has emerged as a best practice is to combine cloud-native data management, data exploration and analytics tools with data saved to MinIO in Apache Iceberg, an open table format. The result is a data lake that is scalable, performant, efficient and seamlessly integrated with other cloud-native tools.

In the previous blog post, Dremio and MinIO on Kubernetes for Fast Scalable Analytics, we discussed how to set up Dremio on Kubernetes and query data residing in MinIO. MinIO is S3-compatible, Kubernetes-native and the fastest object storage on the planet. Packed with features and capabilities to help you make the most of your data lake, MinIO guarantees durability and immutability. Functionality is complemented by security: MinIO encrypts data in transit and on drives, and regulates access to data using IAM and policy based access controls (PBAC).

In this post, we will set up Dremio to read files like CSV from Minio, and we will also access the Apache Iceberg table that was created using Spark as shown here. If you haven't set up Dremio and MinIO yet, you’ll need to follow the walkthrough in the previous post. To learn more about building data lakes with Iceberg and MinIO, please see The Definitive Guide to Lakehouse Architecture with Iceberg and MinIO.

Add MinIO as Datasource

Once Dremio is up and running, login and click on Add Source at the bottom left

Then select Amazon S3 under Object Storage

Fill in the details like the Name of the connector, AWS Access Key, AWS Access Secret and under Buckets, please add openlake as shown below

Next, choose the Advanced Options on the left side of the menu, click to Enable compatibility mode, and add 2 new Connection Properties

fs.s3a.endpoint - play.min.io
fs.s3a.path.style.access - true

Add openlake to the Allowlisted buckets and hit save as shown in the image below

Accessing CSV File

Let's use the taxi-data.csv that we used earlier in the Spark-Iceberg blog post. If you haven’t already completed that tutorial, then please follow these instructions to get the data into MinIO. Click on the openlake datasource that we just setup

Navigate to openlake/spark/sample-data and you should see the taxi-data.csv file. Click on Format File as shown below

Dremio should be able to infer the schema of the CSV file, but we need to tweak some things as shown below

Click on Save and you should be navigated to the SQL Editor. Let’s run a simple query to see the data

SELECT count(*) FROM openlake.openlake.spark."sample-data"."taxi-data.csv";

It will take some time to load the data and compute the count. Once done you should see the result as shown below

It should take a little over 2 mins to complete the above query, depending on the size of the data and the compute resources available to Dremio.

We can perform other query operations, but we will not be able to use Dremio to alter the column names or time travel to previous versions of the data. In order to do that we will use Apache Iceberg.

Accessing Iceberg Table

We will continue to use the nyc.taxis_large Iceberg table that we created in this post to access the data using Dremio. Click on the openlake datasource that we just set up

Navigate to openlake/warehouse/nyc and you should see taxis_large. Click on the Format File and Dremio should be able to infer the schema of the Iceberg table as shown below

Click on Save and you will be taken to the SQL Editor. Let’s run a simple query to see the data

SELECT count(*) FROM openlake.openlake.warehouse.nyc.taxis_large;

Now we’ve seen the first benefit that Iceberg brings to our data lake. The query execution time is much faster than it was for the CSV file, <1s, as shown below

Note: we are seeing all the records in the table since we switched to the earliest snapshot of the table via Iceberg’s time travel when following the instructions in Manage Iceberg Tables with Spark.

At the time of this writing, Dremio does not support time travel via snapshots or provide access to any metadata tables, so we will not be able to fully leverage Iceberg’s features and access the data in the previous snapshots. We can use processing engines like Spark to access the data in previous snapshots.

Update Table Operations

Let's populate the fare_per_distance column by dividing the fare by trip and then query the data. You can execute the following queries in the SQL Editor all at once

SELECT VendorID
,tpep_pickup_datetime
,tpep_dropoff_datetime
,fare
,distance
,fare_per_distance
FROM openlake.openlake.warehouse.nyc.taxis_large LIMIT 15;

UPDATE openlake.openlake.warehouse.nyc.taxis_large SET fare_per_distance = fare/distance WHERE distance > 0;

SELECT VendorID
,tpep_pickup_datetime
,tpep_dropoff_datetime
,fare
,distance
,fare_per_distance
FROM openlake.openlake.warehouse.nyc.taxis_large LIMIT 15;

In the above query, we are fetching the first 15 records from the Iceberg table. There we will notice fare_per_distance is null, so then we execute the UPDATE query to populate the fare_per_distance column. Finally we fetch the first 15 records again and we now see that the fare_per_distance column is populated.

Delete Table Operations

Let's try to delete some records from the table. You can execute the following queries in the SQL Editor all at once

DELETE FROM openlake.openlake.warehouse.nyc.taxis_large WHERE fare_per_distance > 4.0 OR distance > 2.0;
DELETE FROM openlake.openlake.warehouse.nyc.taxis_large WHERE fare_per_distance IS NULL;

SELECT count(*) FROM openlake.openlake.warehouse.nyc.taxis_large;

Above we are deleting records where fare_per_distance is greater than 4.0 or distance is greater than 2.0 and then we are deleting records where fare_per_distance is null, and finally we are counting the number of records in the table.

Monitor Jobs

We can monitor the jobs that are running or already completed in Dremio by clicking on the Jobs tab on the left side of the menu as shown below

Jobs can be filtered based on various parameters depending on the user's needs.

We can also see the details of the job like the memory usage breakdown of the total execution time etc., by clicking on the job name as shown below.

We have now set up Dremio to read data from MinIO and access the Iceberg table created using Spark. The process we have demonstrated can easily be applied to analytics on your data lake.

Stay Cool with Iceberg, Dremio and MinIO

This blog post walked you through working with an Iceberg table saved to MinIO using Dremio. The open source community around Apache Iceberg is active, vibrant and robust. New functionality and integrations are being added constantly. Iceberg’s rapid adoption means that there are plenty of compatible application frameworks and learning resources available.

MinIO is built to power data lakes and the analytics and AI that runs on top of them. MinIO includes a number of optimizations for working with large datasets consisting of many small files, a common occurrence with any of today’s open table formats. The combination of scalability and high-performance puts every workload, no matter how demanding, within reach. See our latest benchmark for more details.

Try Dremio on MinIO today. If you have any questions or want to share tips, please reach out through our Slack channel or drop us a note on hello@min.io.

Previous Post Next Post

S3 Select Security Modern Data Lakes Apache Presto SQL Performance S3 Brand/Design Golang Programming Cloud Computing Microservices Docker AWS Kubernetes Apache Spark Open Source Benchmarks Integrations SUBNET Edge Computing Sidekick Secure-by-Design Splunk Veeam Intel Apache Nifi Immutability Software Defined Storage VMware Apache Arrow Hybrid Cloud Red Hat OpenShift Multicloud Scalability Cloud Field Day Cloud Native Apache Kafka Architect's Guide Awards Operator's Guide Security Advisory AI/ML AGPLv3 Apache Hadoop SFD Azure GCP Observability Analytics R H20 DirectPV DevOps Apache Iceberg Apache Hudi YouTube Summaries EKS Elastic Load Balancers CI/CD Object Storage Compliance opentelemetry BC/DR Storage Newsletter Predictions Best Practices Dremio New MinIO Features partners Small Files Databases DuckDB PostgreSQL Delta Lake Cloud Repatriation Python Object Lambdas Data Pipelines Cloud Operating Model Webhook ClickHouse Vector Database Events Value Engineering Change Data Capture Enterprise Object Store GitOps Case Study Equinix Certifications Snowflake Repatriation Migration Tabular Databricks