Using MinIO as external tables to extend Snowflake

Satish Ramakrishnan Satish Ramakrishnan on 12 June 2022

Snowflake is now the de facto standard for cloud data warehouse platforms. It is architected to support a variety of data tasks from pipelines, ETL, analytics and governance. Traditionally, all the data needed to be moved into Snowflake for an enterprise to take advantage of the Snowflake capabilities.

Snowflake, however, has understood that enterprises want to integrate their data wherever it may reside, without having to move the data. As a result, with the introduction of external table support, enterprises will now be able to achieve just that.

This has real impacts, both on the economics of the Snowflake deployment, but also in the amount of data available to the Snowflake Platform for analytics and data science.

The availability of external tables will not change the locations where Snowflake runs - it will still run exclusively in the three major public clouds (AWS, GCP, Azure). It will however, remove the requirement that all the data be stored in Snowflake for Snowflake to operate on it.

MinIO is perhaps the biggest beneficiary of this change. MinIO is a high-performance, cloud native object store. It will work on all flavors of Kubernetes (up-stream, EKS, AKS, GKE, etc.) as well as on virtual machines like the public cloud VMs, VMWare, etc. and on bare metal hardware. Because of this, MinIO can become the global datastore for Snowflake customers - wherever their data sits.

To guarantee that the right data is available to the right user, it is imperative that there be fine-grained access control on these multi-cloud data lakes. MinIO’s ability to integrate with 3rd party IDPs and its sophisticated Policy Based Access Control (PBAC) capabilities ensure that this is not an after-thought.

While Snowflake will support S3 endpoints (which naturally includes other object stores), those object stores are not capable of running in all of the places that an enterprise keeps its data. For example, appliances don’t run in the public cloud or on OpenShift or Tanzu. To achieve a consistent, data anywhere strategy, enterprises will need to adopt MinIO.

Furthermore, Snowflake has become mission critical in the enterprise. So much so that resiliency is now getting architected into these systems. That resiliency doesn’t just account for region failure in the cloud, it accounts for entire cloud failure.

SIngle cloud reliance has proven to be poor architecture as I have written about previously. MinIO, with its ability to be available everywhere and replicate between private, public and edge clouds, provides the only true solution for multi-cloud data availability.

Simple to Use

So, how simple is it for the end user? Imagine you have a bucket in MinIO called ‘Bucket1’. You can set it up so that you can run any SnowSQL command on it like it is just another table.

For example: Select * from Bucket1;

Like most things in Snowflake, connecting to your data stored on MinIO is also relatively simple.

Here is what you need to get started:

Setting up Snowflake

External table support needs to be turned on for your Snowflake environment. If you get errors like 'invalid URL prefix found in: 's3compat://snowflake/', it probably means that it is not enabled. Please contact your Snowflake rep to get it enabled.

Setting up MinIO

There are only a couple of requirements on the MinIO setup to get it to work with Snowflake.

MinIO has to be set up to allow for DNS style access and the bucket has to be available publicly.
Currently, the region has to be set to NULL or the same as the region of the Snowflake instance (e.g. ‘us-west-2’). You will get an error message from the Snowflake query so you will know which region it is expecting.

In the example below, the bucket “snowflake” has 4 objects in it and another object in the folder/sub-prefix “sn1”. Snowflake will fetch all of these objects.

In order to integrate this with Snowflake, if the MinIO server is at https://play.min.io, the bucket should be accessible at https://snowflake.play.min.io

The sample files are available at https://docs.snowflake.com/en/user-guide/getting-started-tutorial-prerequisites.html#sample-data-files-for-loading

There are two ways that the MinIO buckets can be integrated with Snowflake. The first is as a “Staging Area” and the second is as an external table. Let’s look at both

Staging Area Access

As part of an ETL process, data is normally selected and prepared from a data source or a data lake, moved into a staging area and then loaded into a Warehouse so that queries can be run against it.

Traditionally, these staging areas for Snowflake were in Snowflake itself, so this was the flow:

Data lake (on-prem, public cloud, etc.) -> Snowflake staging -> Snowflake internal table

Enterprises can now set up MinIO as a staging area and move the data directly into an internal Snowflake table to run queries against it.

This does the following

Simplifies the flow
Allows for data to be available faster for analysis to meet business needs
Ensures that there are no unnecessary copies of the data which helps in both governance and cost.

The new flow would be: Data in MinIO (on-prem, public cloud, etc.) -> Snowflake internal table

The following examples are shown in the Snowflake console, but the same can be run through the Snowflake CLI.

For play.min.io, use
AWS_KEY_ID='Q3AM3UQ867SPQQA43P2F'
AWS_SECRET_KEY='zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG'

The reference for the Snowflake CREATE STAGE command is available here: https://docs.snowflake.com/en/sql-reference/sql/create-stage.html

External Table Access

Queries in Snowflake had to be performed on internal tables.This meant that all data - even for ad hoc queries - had to be moved to Snowflake. This resulted in both cost as well as the inability to query in a timely manner.

Enterprises can now directly access data in MinIO buckets using the new external table access capabilities introduced by Snowflake.

The time taken by the initial query will depend on the amount of data that is being transferred, but reads are cached and subsequent queries may be faster till the next refresh.

Using the external table approach the data does not need to be copied and the bucket can be used as an external table for queries, joins, etc.

However, in exchange, the benefits are huge.

This extends the capabilities of the warehouse without incurring the cost of the move.
The ability to run analysis on real-time data is now available.
Moving data just to run an ad hoc query can be completely avoided.
Analysis is possible in instances when the data cannot be moved for compliance or other business reasons.
You still get all the advantages of the Snowflake capabilities with the same resources who are already familiar with the Snowflake platform

An example is shown below:

The reference for the Snowflake CREATE EXTERNAL TABLE command is available here: https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html

Using Snowflake CLI (SnowSQL)

The same commands can be performed using the Snowflake CLI (SnowSQL).

You can find the instructions on how to install SnowSQL on your platform here: https://docs.snowflake.com/en/user-guide/snowsql-install-config.html

The only difference is that you have to start with the command

$ snowsql -a <account_identifier> -u <user_name>

The account-identifier is of the format <organization-name>-<account-name>. You can find this from the Admin page in the Snowflake console.

Summary

And that is it.

If you are a MinIO customer looking to add Snowflake capabilities or if you are a Snowflake customer looking to extend its capabilities to data stored outside Snowflake, give it a try. Either way you get to extract more out of your current investment.

Previous Post Next Post

S3 Select Security Modern Data Lakes Apache Presto SQL Performance S3 Brand/Design Golang Programming Cloud Computing Microservices Docker AWS Kubernetes Apache Spark Open Source Benchmarks Integrations SUBNET Edge Computing Sidekick Secure-by-Design Splunk Veeam Intel Apache Nifi Immutability Software Defined Storage VMware Apache Arrow Hybrid Cloud Red Hat OpenShift Multicloud Scalability Cloud Field Day Cloud Native Apache Kafka Architect's Guide Awards Operator's Guide Security Advisory AI/ML AGPLv3 Apache Hadoop SFD Azure GCP Observability Analytics R H20 DirectPV DevOps Apache Iceberg Apache Hudi YouTube Summaries EKS Elastic Load Balancers CI/CD Object Storage Compliance opentelemetry BC/DR Storage Newsletter Predictions Best Practices Dremio New MinIO Features partners Small Files Databases DuckDB PostgreSQL Delta Lake Cloud Repatriation Python Object Lambdas Data Pipelines Cloud Operating Model Webhook ClickHouse Vector Database Events Value Engineering Change Data Capture Enterprise Object Store GitOps Case Study Equinix Certifications Snowflake Repatriation Migration Tabular Databricks