Data Science and AI with a SQL Server 2022 Data Lakehouse

Matt Sarrel Matt Sarrel Matt Sarrel @msarrel on SQL |
Data Science and AI with a SQL Server 2022 Data Lakehouse

Microsoft SQL Server 2022 is one of the most commonly implemented enterprise relational databases. Many of the world's most successful companies, regardless of vertical, have significant SQL Server deployments. Thousands of companies have relied on SQL Server for decades.

Microsoft has made great strides over the past decade in embracing open-source and standards-compliant technologies. The result is that instead of spending developer cycles building features that lock in users, the company devotes serious resources to developing new cloud-native software and enhancing existing software with cloud-native features.

One such feature is the ability in SQL Server 2022 to query data that resides on object storage. This is accomplished via an S3 Connector and Polybase data virtualization. This is a tremendous enabler of AI. Enterprises are now free to model the vast amounts of data they have stored in object storage with SQL Server 2022. Data can live anywhere MinIO runs and be available for AI – public cloud, private cloud, on-premise, at the edge.

AI requires massive amounts of data to train. Large Language Models (LLMs) require the equivalent of an entire library to train and this is pulled from various sources, such as internal corporate databases, public databases, books, online articles, email, and more, and all this data resides in MinIO. The more data you feed a model, the more valuable it is. AI requires underlying software that is scalable, available, secure, performant, resilient and simple. 

MinIO is the best home for the data lakehouse to fuel your data science and AI workloads, because it runs everywhere, and that's a great thing when enterprises have data that lives everywhere. Data saved to MinIO is durable and erasure-coded, while leading S3 API compatibility means that every application runs as it was intended to. As the fastest object storage on the planet, there is no better way to build an AI data lakehouse than with MinIO. 

This blog post will show you how to store data in MinIO and query it using SQL Server 2022 for use in AI programs. It is based on Microsoft's SQL Server 2022 Workshop

Create Self-Signed Certificate 

You will need a certificate (private/public keys) so the OS running SQL Server 2022 and MinIO Server can trust each other. Let's generate it now. 

Use openssl for Linux or Windows to create keys. 

These instructions are for Linux. Please see the documentation for Windows.  Use the following to generate an openssl.cnf file that will be used when generating the certificate. 

cat <<EOF | cat > openssl.cnf
[req]
distinguished_name = s3.example.com
x509_extensions = v3_req
prompt = no

[s3.example.com]
C = US
ST = CA
L = Monterey
O = IT
OU = DBATeam
CN = s3.example.com

[v3_req]
subjectAltName = @alt_names

[alt_names]
IP.1 = 127.0.0.1
DNS.1 = localhost
IP.2 = your-minio-ip-address
DNS.2 = s3.example.com
EOF

Use openssl to generate the certificate using the configuration we created above.

openssl req -x509 -nodes -days 3650 -newkey rsa:2048 -keyout private.key -out public.crt -config openssl.cnf

Configure MinIO to Use Certificate

We need to configure MinIO to use the key and certificate that we just created. Copy the key and the certificate to add trust for these certificates to MinIO Server by placing these certificates under one of the following MinIO configuration paths:

  • Linux: ~/.minio/certs/CAs/

Make sure to restart the MinIO server after copying the certificates. 

Configure Self-Signed Certificate

We need to tell the OS that SQL Server 2022 is running on to trust the certificate that we generated for MinIO. If you're on Linux, copy the certificate into the known CA directory, set permissions on the files and then restart.

mkdir /usr/local/share/ca-certificates/mssql-ca-certificates/
cp public.crt /usr/local/share/ca-certificates/mssql-ca-certificates/
chown 10001:10001 -R /usr/local/share/ca-certificates/mssql-ca-certificates/

In Windows, double-click the public.crt file and select Install Certificate. Then choose Local Machine, Place all certificates in the following store, Browse and select Trusted Root Certificate Authorities.

Deploy SQL Server 2022

If you have SQL Server 2022 deployed, then feel free to use that. 

If not, please install one of the following, making sure to install Polybase as well:

You will also need a way to interact with SQL Server:

Make sure that SQL Server is running and connect to it using either of the tools mentioned above.

To make sure everything is installed properly, run the following query to obtain the SQL Server edition, version and product level, as shown below. 

SELECT  SERVERPROPERTY('Edition') AS SQLEdition,
        SERVERPROPERTY('ProductVersion') AS ProductVersion,
        SERVERPROPERTY('ProductLevel') AS ProductLevel;

Install and Configure MinIO Server

If you’re not already running it, please install MinIO on bare metal or Kubernetes.

MinIO Client (mc) is required to access the MinIO Server. Here’s how to install mc locally.

Make note of the credentials and S3 endpoint displayed the first time you run MinIO Server because you will need them to configure SQL Server for the external data source.

Log into MinIO using a web browser with the administrator credentials, IP address and port provided in the previous step. From the left menu, click Access Keys and create a new key. It is a best practice to provide access to services using Access Keys that are restricted to the bare minimum privileges that the service requires. In this case, Polybase requires ListBucket, ReadOnly and WriteOnly.

Make a note of the Access Key and Secret Key because you will need them to configure the data source in SQL Server.

From the left menu, select Buckets and then Create Bucket. Name the bucket externaltables and leave the default configurations and click Create.

Import Data

Return to SQL Server. We're going to use the WideWorldImporters (WWI) sample database from Microsoft. WWI is a hypothetical wholesale novelty goods importer and distributor headquartered in the San Francisco Bay Area. WWI sells to retail customers and wholesalers. While all of WWI's customers are currently based in the United States, the company is looking to expand internationally. The company stocks novelties in its warehouse as well as large volumes of packing materials. Recently, WWI started selling chocolates that must be kept at a safe temperature, so they monitor the temperature in the chiller room and on refrigerated trucks using streaming data from IOT devices. For more information, please see Wide World Importers sample databases for Microsoft SQL

Download the sample database to a local directory. The script below uses c:\sql_sample_databases.

USE master;
GO
DROP DATABASE IF EXISTS WideWorldImporters;
GO
RESTORE DATABASE WideWorldImporters FROM DISK = 'c:\sql_sample_databases\WideWorldImporters-Full.bak' with
MOVE 'WWI_Primary' TO 'c:\sql_sample_databases\WideWorldImporters.mdf',
MOVE 'WWI_UserData' TO 'c:\sql_sample_databases\WideWorldImporters_UserData.ndf',
MOVE 'WWI_Log' TO 'c:\sql_sample_databases\WideWorldImporters.ldf',
MOVE 'WWI_InMemory_Data_1' TO 'c:\sql_sample_databases\WideWorldImporters_InMemory_Data_1',
stats=5;
GO
ALTER DATABASE WideWorldImporters SET QUERY_STORE CLEAR ALL;
GO

Run the above script on SQL Server using MSSMS or sqlcmd to load the demo data.

Install and Enable PolyBase

PolyBase enables you to run queries in SQL Server against external data sources. Polybase is typically installed with SQL Server, but if it was not then you must install it. 

Confirm that Polybase is installed, 1 = installed

SELECT SERVERPROPERTY ('IsPolyBaseInstalled') AS IsPolyBaseInstalled;

Next, enable Polybase in your configuration. Even if you had Polybase installed, you will need to enable export to external tables. 

EXEC sp_configure 'polybase enabled', 1;
GO
RECONFIGURE;
GO
EXEC sp_configure 'allow polybase export', 1;
GO
RECONFIGURE;
GO

Configure Access to External Data using Polybase and MinIO

Switch into the database context for the WideWorldImporters database

USE WideWorldImporters;

Create a database master key, this is used to protect the credentials you’re about to create

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0methingS@Str0ng!';

Create a database-scoped credential using the Access Key created earlier. Note that the secret contains both an access key and a secret key.

CREATE DATABASE SCOPED CREDENTIAL s3_wwi_cred
WITH IDENTITY = 'S3 Access Key',
SECRET = '<user>:<password>';
GO

Create an external data source on your s3 compatible object storage, referencing where it is on the network with LOCATION, and the credential you just defined

CREATE EXTERNAL DATA SOURCE s3_wwi
WITH
(
 LOCATION = 's3://<your local IP>:9000'
,CREDENTIAL = s3_wwi_cred
);
GO

If you see the following error, then restart SQL Server and create the external data source again.

Msg 46530, Level 16, State 11, Line 1 External data sources are not supported with type GENERIC.

Create and Query Parquet File

In this step, we will export a query on WWI data from SQL Server and save it as Parquet on MinIO. Parquet, along with ORC and AVRO, is a commonly used format for AI. Parquet is a great file format for moving data around because it is self-describing and columnar.

Create a file format to use for Parquet

USE WideWorldImporters;
GO
IF EXISTS (SELECT * FROM sys.external_file_formats WHERE name = 'ParquetFileFormat')
	DROP EXTERNAL FILE FORMAT ParquetFileFormat;
CREATE EXTERNAL FILE FORMAT ParquetFileFormat WITH (FORMAT_TYPE = PARQUET);
GO

Create a Parquet file into the MinIO storage bucket you created earlier

USE WideWorldImporters;
GO
IF OBJECT_ID('wwi_customer_transactions', 'U') IS NOT NULL
	DROP EXTERNAL TABLE wwi_customer_transactions;
GO
CREATE EXTERNAL TABLE wwi_customer_transactions
WITH (
    LOCATION = '/externaltables/',
    DATA_SOURCE = s3_wwi,  
    FILE_FORMAT = ParquetFileFormat
) 
AS
SELECT * FROM Sales.CustomerTransactions;
GO

Return to the MinIO Console and verify that the Parquet file was created.

You can also create ad-hoc queries for the Parquet file, or any Parquet file.

USE [WideWorldImporters];
GO
SELECT *
FROM OPENROWSET
	(BULK '/wwi/'
	, FORMAT = 'PARQUET'
	, DATA_SOURCE = 's3_wwi')
as [wwi_customer_transactions_file];
GO

What if we wanted to start with the Parquet file and build an external table to work with the data in SQL Server? We could even select the columns we want to include in the new external table.

USE [WideWorldImporters];
GO
IF OBJECT_ID('wwi_customer_transactions_base', 'U') IS NOT NULL
	DROP EXTERNAL TABLE wwi_customer_transactions_base;
GO
CREATE EXTERNAL TABLE wwi_customer_transactions_base 
( 
	CustomerTransactionID int, 
	CustomerID int,
	TransactionTypeID int,
	TransactionDate date,
	TransactionAmount decimal(18,2)
)
WITH 
(
	LOCATION = '/wwi/'
    , FILE_FORMAT = ParquetFileFormat
    , DATA_SOURCE = s3_wwi
);
GO
SELECT * FROM wwi_customer_transactions_base;
GO

Statistics and Metadata

The SQL below will create statistics for a column in an external table. Query optimization statistics for external tables are more limited than for internal tables and indexed views. The Query Optimizer stores statistics for a table or indexed view in a statistics object that includes a header with metadata, a histogram with the distribution of values in the first key column of the statistics table and a density vector to measure cross-column correlation. 

USE [WideWorldImporters];
GO
CREATE STATISTICS wwi_ctb_stats2 ON dbo.wwi_customer_transactions_base (CustomerID) WITH FULLSCAN;
GO

This information is very helpful when trying to manually optimize queries. Use the following SQL to view statistics

USE WideWorldImporters;
DBCC SHOW_STATISTICS ("dbo.wwi_customer_transactions", wwi_ctb_stats2);

Explore metadata about all external tables and data sources with the following SQL

USE [WideWorldImporters];
GO
SELECT * FROM sys.external_data_sources;
GO
SELECT * FROM sys.external_file_formats;
GO
SELECT * FROM sys.external_tables;

Explore the metadata about files from external sources

USE [WideWorldImporters];
GO
SELECT TOP 1 wwi_customer_transactions_file.filepath(), 
wwi_customer_transactions_file.filename()
FROM OPENROWSET
	(BULK '/wwi/'
	, FORMAT = 'PARQUET'
	, DATA_SOURCE = 's3_wwi')
as [wwi_customer_transactions_file];
GO

Explore the Parquet file metadata

USE [WideWorldImporters];
GO
EXEC sp_describe_first_result_set N'
SELECT *
FROM OPENROWSET
	(BULK ''/wwi/''
	, FORMAT = ''PARQUET''
	, DATA_SOURCE = ''s3_wwi'')
as [wwi_customer_transactions_file];';
GO

 

Bringing It All Together

External tables have the most value when they are used in conjunction with internal operational data such as transactions, inventory and customers. We could, for example, compare revenue growth by city to population growth to focus resources or examine trends in refrigerator truck temperature to see if it affects sales of perishable chocolates. These types of analyses would require joining internal and external tables.

Query the external table of transactions, join with an internal table for customer names and sort by total_balance. We want to find out which of our customers has an outstanding balance and then sort by the amount outstanding so we can prioritize outreach. 

USE WideWorldImporters;
GO
SELECT c.CustomerName, SUM(wct.OutstandingBalance) as total_balance
FROM wwi_customer_transactions wct
JOIN Sales.Customers c
ON wct.CustomerID = c.CustomerID
GROUP BY c.CustomerName
ORDER BY total_balance DESC;

The above example is a quick taste of working with external tables in SQL Server 2022. We ran a simple query to demonstrate the value of external tables for analytics. Another common use of external tables is for AI.

No Limits on AI with SQL Server 2022 and MinIO

With the addition of full external table capabilities, SQL Server 2022 is a powerful analytics and machine-learning platform. Simply add the SQL Machine Learning service to your existing installation. You can write code in Python, R or Java to use this service. Or use your machine learning platform of choice to read in SQL Server external tables.

Huge volumes of enterprise data live in SQL Server, and even huger volumes live in MinIO. SQL Server gives you analytics and AI/ML all wrapped up in everyone's favorite SQL.

Turning raw data into insights allows business decision-makers to fully understand every aspect of the business and react to new situations quickly. Advanced analytics and AI/ML play an important role in modern enterprises. Data science distills data assets in nuggets of knowledge that inform and automate the business. Combining SQL Server's SQL Machine Learning with MinIO provides data scientists with the ability to identify credit risks, pinpoint supply chain risks, prevent customer churn, maximize customer lifetime value, reduce downtime for equipment – and so much more.  

Download MinIO today and get started building your AI infrastructure. As always, reach out to us on Slack if you have any questions.