Using External Tables to Store and Query Data on MinIO with SQL Server 2022

Moiz Kohari Moiz Kohari on SQL |
Using External Tables to Store and Query Data on MinIO with SQL Server 2022

Data is one thing, insight is another. There is a range of ways to extract insight from information - and that range comes with tradeoffs. If you think about this as a pyramid, you have AI at the top, ML just below it, advanced analytics below that and SQL as the foundation. This is a simplified, but representative picture of analytics in the enterprise.

The challenge is that AI/ML talent is really hard to come by. This isn’t necessarily news, but it was crystalized in conversations with colleagues who are responsible for running analytics for some of the largest Fortune 500 companies. While AI and ML promises to provide some of the deepest insights into data, setting up and leveraging these ML pipelines requires highly sophisticated data science capabilities. In many cases these are long ranging projects that take months if not years to complete, and may require retraining existing staff to learn these new data science paradigms.

On the other hand, SQL skills are everywhere in the organization and are the lingua franca of data. It is how the world speaks to databases. It is why 95% of the Fortune 500 runs Microsoft’s SQL Server. There may not be a more ubiquitous data technology in the enterprise.

With Microsoft SQL Server 2022, object storage becomes a first class citizen through a technique called external tables. There are two parts. First, an S3 connector. Second, enhancements to the Polybase data virtualization framework.

With this addition, enterprises can now query massive object stores with SQL Server 2022 - wherever those object stores exist. The new feature addresses two key customer challenges - ways to access the data where it is and not having to learn new languages.

With MinIO data can be anywhere - public cloud infrastructure, private clouds, Kubernetes distributions, edge.

Let’s dig a little deeper.

Access Data Anywhere with Minio and Microsoft SQL Server 2022 combination

Microsoft SQL Server 2022 has two key features that users of MinIO can take advantage of.

First, it enables database administrators to backup and restore data to any cloud environment using the S3 API. Second, SQL Server 2022 expands the external tables function that was initially introduced in SQL Server 2016.

External tables allow the user to run queries against data that is not stored directly in SQL Server. That frees up the user to have the architectural flexibility to run SQL Server where they want to and store their data where and how they want to, while retaining the ability to query that data without moving it. The new external tables functionality now allows it to reach all the typically supported file formats plus Parquet and S3-object oriented storage.

This will expand, dramatically, the amount of data that can be queried by SQL server.

For example, EnterpriseCo could be running SQL Server in Azure. Previously, they moved data into Azure/SQL server to do their queries. Now, EnterpriseCo can access that data onprem, without having to move the data anywhere. This means that those queries can run against PB sized datasets.

The architectural choices become almost infinite. Run SQL Server on prem and access data in the cloud. Run SQL Server on prem and access data in the private cloud. Run SQL Server in the cloud and access data at the edge. Build a fault-tolerant analytics system with active-active, multi-cloud replication using MinIO and SQL Server. The primary considerations are performance and those are determined by the relative locations of the SQL Server and the data.

The trick is to optimize the deployments. All object stores will benefit from the external table feature, but MinIO benefits disproportionately. The reason is simple, in a world where enterprise data is distributed across multiple clouds, consistency matters. MinIO can serve as the datastore on the public cloud, private cloud and edge. As the consistent data layer, security, access management, resilience and lifecycle management are also consistent. Performance can be consistent or can be tiered to the needs of the business - while everything else remains the same.

The mechanism to do this in SQL Server 2022 is through a data virtualization feature known as Polybase.

Polybase allows a user to query data using Transact-SQL directly from not only SQL Server but most other database installations (like Oracle, Teradata, MongoDB, etc.) and now S3 APIs. MinIO provides the unique capability to access all hyperscaler cloud environments. The combination of the two (SQL Server 2022 and MinIO) enables an enterprise to access data and drive insights from data silos that, up until now, were segmented and difficult to combine.

A key feature of PolyBase is the ability to allow data to stay in its original location and format. External data can be queried in place like any other table in SQL Server. This scenario minimizes the need for data movement and data replication. One of the biggest problems that data replication presents is reconciliation, enterprises waste thousands of man hours and compute power reconciling data on a nightly basis.

SQL Server 2022 allows for continuous replication of data to and from the cloud enabling disaster recovery capabilities. The combination with MinIO allows for fast data to reside on NVMe based fast data engines and tier it to any number of slower tiers. These components are running side by side to enable you to read, write, and process big data using Transact-SQL or Spark libraries, allowing you to easily combine and analyze your high-value relational data with non-relational high-volume big data.

Additionally, the SQL Server 2022 integrates with Active Directory and includes role-based access control to satisfy security and compliance needs of an enterprise.

Why This Matters

The significance of the addition of external tables cannot be overstated.

First, it is further evidence that object storage has become primary storage when it comes to data infrastructure. The SAN/NAS crowd doesn’t like this message but with every passing week another major database or data component turns on object storage in a first class citizen kind of way. There are not many holdouts and those that are holding out are looking pretty niche-y these days.

Second, this enables SQL queries to be executed against far larger datasets than ever before possible. In turn that generates the possibility of generating more value from existing data and new value from long-tail or “dark” data. SAN/NAS repositories generally hit a wall at a few PBs. Imagine running a query against 10s or 100s of PBs? That’s on the table at this point.

Third, this further cements the second generation disaggregation story we have been seeing for a couple of years now. The first generation being the separation of storage and compute. The second, parallel generation, being the analytics application layer (SQL or Splunk or Vertica) disaggregating the query function from the storage function. This is frankly everywhere and is in the best interest of the query players, the storage players and the customers. It means that customers will build best-of-breed, bespoke stacks for their use-case specific needs. It will emphasize performance, resilience, security, cloud-native-ness, scalability on the storage side - which will in turn demand the solution be software-defined.

Fourth, this meets customers where they are in terms of the multi-cloud. Customers don’t want to run SQL Server 2022 in one place - they want to run it wherever their data resides. The addition of external tables (and MinIO as a partner) delivers on that need. On-prem. Check. Azure. Check. Every other public cloud and Kubernetes distribution. Check.

This is a watershed moment for SQL Server and for MinIO. We are going to be the go-to object store for all the reasons outlined above and that will expand our already deep penetration at F500 accounts.

Customer Use Case Examples

We can’t get into too much detail at this stage - but we have several customers that have been eagerly awaiting the Early Access Window to start building toward the November GA date.

For example, one of the Fortune100 banks we work with stores large amounts of client behavioral data on S3 storage clusters. In order to run simple SQL queries to drive basic intelligence from this data they have used techniques in the past to ingest parts of this data into certain databases and then run queries against it. This client is signing up for the Microsoft early access program such that they can run queries against their data without the need to replicate it into new database instances. This will save them significant amounts of time and the data reconciliation headache that most banks need to deal with. Further, if this bank decides to employ machine learning techniques to drive more sophisticated insights from this data, they will have the ability to leverage MLOps platforms that can easily be scaled on Kubernetes clusters alongside their original implementation.

In another example, we have a large retail customer who needs to bring information from distributed point-of-sales systems into a centralized datastore that uses MinIO as the object storage platform for all nightly activity. The files may be in parquet or csv format and are transferred into buckets. In the past, the business analytics teams needed to import this data into their SQL Server environments to run batch analytics on this data. However, with the introduction of SQL Server 2022, clients like this one will have the ability to directly run analytics against the data sitting in the S3 buckets on MinIO. This will not only save them time but will also save them the expense associated with data replication and reconciliation. Just as importantly, the retailer (any any enterprise for that matter) will be able to leverage AI/ML pipelines to draw deeper insights directly from this data. In the case of Microsoft, the retailer has provided integration with Azure Synapse to drive deep ML pipelines. Other methods for driving this level of insights may leverage TensorFlow, Kubeflow and Pytorch all which have first class integration with MinIO.



What to expect

With today’s announcement, the Early Access window is open. There are a few things that are not present in this release (disaster recovery for example) but those will be ready shortly.

The expectation is for general release in November but we are encouraging all of our enterprise customers to start now. It will help with sizing to meet the requirements of the specific workload. Some customers will go big, others will keep it in the hundreds of TBs. The key is that properly configured, both will go fast.

What’s Next

We are going to do a webinar with Hugo Queiroz on DevOps.com on July 12. This will provide an overview of the joint value proposition and an opportunity to go hands-on and see exactly how to configure SQL Server 2022 to scale with external tables. Registration will be open shortly.

In the interim, please don’t hesitate to reach out on hello@min.io or join us on our Slack channel for any questions that you might have.