MinIO is ideal for storing the unstructured, semi-structured and structured data that feed enterprise data lakes. Data lakes form the backbone of advanced analytics, AI/ML and the data-driven business overall. The architecture that has emerged as a best practice is to combine cloud-native data management, data exploration and analytics tools with data saved to MinIO in Apache Iceberg, an open table format. The result is a data lake that is scalable, performant, efficient and seamlessly integrated with other cloud-native tools.
In the previous blog post, Dremio and MinIO on Kubernetes for Fast Scalable Analytics, we discussed how to set up Dremio on Kubernetes and query data residing in MinIO. MinIO is S3-compatible, Kubernetes-native and the fastest object storage on the planet. Packed with features and capabilities to help you make the most of your data lake, MinIO guarantees durability and immutability. Functionality is complemented by security: MinIO encrypts data in transit and on drives, and regulates access to data using IAM and policy based access controls (PBAC).
In this post, we will set up Dremio to read files like CSV from Minio, and we will also access the Apache Iceberg table that was created using Spark as shown here. If you haven't set up Dremio and MinIO yet, you’ll need to follow the walkthrough in the previous post. To learn more about building data lakes with Iceberg and MinIO, please see The Definitive Guide to Lakehouse Architecture with Iceberg and MinIO.
Add MinIO as Datasource
Once Dremio is up and running, login and click on
Add Source at the bottom left
Amazon S3 under
Fill in the details like the
Name of the connector,
AWS Access Key,
AWS Access Secret and under
Buckets, please add
openlake as shown below
Next, choose the
Advanced Options on the left side of the menu, click to
Enable compatibility mode, and add 2 new
- fs.s3a.endpoint - play.min.io
- fs.s3a.path.style.access - true
openlake to the
Allowlisted buckets and hit
save as shown in the image below
Accessing CSV File
Let's use the
taxi-data.csv that we used earlier in the Spark-Iceberg blog post. If you haven’t already completed that tutorial, then please follow these instructions to get the data into MinIO. Click on the
openlake datasource that we just setup
openlake/spark/sample-data and you should see the
taxi-data.csv file. Click on
Format File as shown below
Dremio should be able to infer the schema of the CSV file, but we need to tweak some things as shown below
Save and you should be navigated to the SQL Editor. Let’s run a simple query to see the data
It will take some time to load the data and compute the count. Once done you should see the result as shown below
It should take a little over 2 mins to complete the above query, depending on the size of the data and the compute resources available to Dremio.
We can perform other query operations, but we will not be able to use Dremio to alter the column names or time travel to previous versions of the data. In order to do that we will use Apache Iceberg.
Accessing Iceberg Table
We will continue to use the
nyc.taxis_large Iceberg table that we created in this post to access the data using Dremio. Click on the
openlake datasource that we just set up
openlake/warehouse/nyc and you should see
taxis_large. Click on the
Format File and Dremio should be able to infer the schema of the Iceberg table as shown below