Supercharge TileDB Engine with MinIO
MinIO makes a powerful primary TileDB backend because both are built for performance and scale. MinIO is a single Go binary that can be launched in many different types of cloud and on-prem environments. It's very lightweight, but also feature-packed with things like replication and encryption, and it provides integrations with various applications. MinIO is the perfect companion for TileDB because of its industry-leading performance and scalability. MinIO is capable of tremendous performance – we’ve benchmarked it at 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177 GB/s) on PUTs with just 32 nodes of off-the-shelf NVMe SSDs – and is used to build data lakes/lake houses with analytics and AI/ML workloads.
TileDB is used to store data in a variety of applications, such as Genomics, Geospatial, Biomedical Imaging, Finance, Machine Learning, and more. The power of TileDB stems from the fact that any data can be modeled efficiently as either a dense or a sparse multi-dimensional array, which is the format used internally by most data science tooling. By storing your data and metadata in TileDB arrays, you abstract all the data storage and management pains, while efficiently accessing the data with your favorite programming language or data science tool via our numerous APIs and integrations.
Set Up TileDB
Let’s dive in and create some test data using TileDB
Install the TileDB pip
module, which should also install the numpy
dependency.
% pip3 install tiledb Collecting tiledb Downloading tiledb-0.25.0-cp311-cp311-macosx_11_0_arm64.whl (10.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 2.7 MB/s eta 0:00:00 Collecting packaging Downloading packaging-23.2-py3-none-any.whl (53 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.0/53.0 kB 643.1 kB/s eta 0:00:00 Collecting numpy>=1.23.2 Downloading numpy-1.26.3-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 2.5 MB/s eta 0:00:00 Installing collected packages: packaging, numpy, tiledb Successfully installed numpy-1.26.3 packaging-23.2 tiledb-0.25.0 |
Create a test array by running the below Python script, name it tiledb-demo.py
.
import tiledb import numpy as np import os, shutil # Local path array_local = os.path.expanduser("./tiledb_demo") # Create a simple 1D array tiledb.from_numpy(array_local, np.array([1.0, 2.0, 3.0])) # Read the array with tiledb.open(array_local) as A: print(A[:]) |
Run the script
% python3 tiledb-demo.py [1. 2. 3.] |
This will create a directory called tiledb_demo
to store the actual data.
% ls -l tiledb_demo/ total 0 drwxr-xr-x 3 aj staff 96 Jan 31 05:27 __commits drwxr-xr-x 2 aj staff 64 Jan 31 05:27 __fragment_meta drwxr-xr-x 3 aj staff 96 Jan 31 05:27 __fragments drwxr-xr-x 2 aj staff 64 Jan 31 05:27 __labels drwxr-xr-x 2 aj staff 64 Jan 31 05:27 __meta drwxr-xr-x 4 aj staff 128 Jan 31 05:27 __schema |
You can continue using it as is but it's no bueno if everything is local because if the local disk or node fails then you lose your entire data. Let's do something fun, like reading this same data from a MinIO bucket instead.
Migrating Data to MinIO Bucket
We’ll start by pulling mc in our docker ecosystem and then using play.min.io to create the bucket.
Pull mc docker image
% docker pull minio/mc |
Test with MinIO Play by listing all the buckets
% docker run minio/mc ls play [LONG TRUNCATED LIST OF BUCKETS] |
Create a bucket to move our local TileDB data to, name it tiledb-demo
.
% docker run minio/mc mb play/tiledb-demo
|
Copy the contents of the tiledb_demo
data directory to the MinIO tiledb-demo
bucket
% docker run -v $(pwd)/tiledb_demo:/tiledb_demo minio/mc cp --recursive /tiledb_demo play/tiledb-demo `/tiledb_demo/__commits/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21.wrt` -> `play/tiledb-demo/tiledb_demo/__commits/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21.wrt` `/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/a0.tdb` -> `play/tiledb-demo/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/a0.tdb` `/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/__fragment_metadata.tdb` -> `play/tiledb-demo/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/__fragment_metadata.tdb` `/tiledb_demo/__schema/__1706696859758_1706696859758_74e7040e138a4cca93e34aca1c587108` -> `play/tiledb-demo/tiledb_demo/__schema/__1706696859758_1706696859758_74e7040e138a4cca93e34aca1c587108` Total: 3.24 KiB, Transferred: 3.24 KiB, Speed: 1.10 KiB/s |
List the contents of tiledb-demo
to make sure the data has been copied
% docker run minio/mc ls play/tiledb-demo/tiledb_demo [2024-01-15 14:15:57 UTC] 0B __commits/ [2024-01-15 14:15:57 UTC] 0B __fragments/ [2024-01-15 14:15:57 UTC] 0B __schema/ |
Note: The MinIO Client (mc
), or any S3 compatible client, only copies non-empty folders. The reason for this is that in the object storage world the data is organized based on bucket prefixes, so non-empty folders are not needed. In a future blog we’ll dive deeper into how data is organized with prefixes and folders. Hence, you see only these 3 folders and not the rest that we saw in the local folder.
Now let’s try to read the same data directly from the MinIO bucket using the Python code below, name the file tiledb-minio-demo.py
.
import tiledb import numpy as np # MinIO keys minio_key = "minioadmin" minio_secret = "minioadmin" # The configuration object with MinIO keys config = tiledb.Config() config["vfs.s3.aws_access_key_id"] = minio_key config["vfs.s3.aws_secret_access_key"] = minio_secret config["vfs.s3.scheme"] = "https" config["vfs.s3.region"] = "" config["vfs.s3.endpoint_override"] = "play.min.io:9000" config["vfs.s3.use_virtual_addressing"] = "false" # Create TileDB config context ctx = tiledb.Ctx(config) # The MinIO bucket URI path of tiledb demo array_minio = "s3://tiledb-demo/tiledb_demo/" with tiledb.open(array_minio, ctx=tiledb.Ctx(config)) as A: print(A[:]) |
The output should look familiar
% python3 tiledb-minio-demo.py [1. 2. 3.] |
We've read from MinIO, next let's see how we can write the data directly in a MinIO bucket, instead of copying it to MinIO from an existing source.
Writing Directly to the MinIO Bucket
So far we’ve shown you how to read data that already exists, either in local storage or an existing bucket. But if you wanted to start fresh by writing directly to MinIO from the get-go, how would that work? Let’s take a look.
The code to write data directly to the MinIO bucket is the same as above except with two line changes.
The path to the MinIO bucket where TileDB data is stored must be updated to tiledb_minio_demo
(instead of tiledb_demo
).
We’ll use the tiledb.from_numpy
function, as we did earlier with local storage, to create the array to store in the MinIO bucket.
[TRUNCATED] # The MinIO bucket URI path of tiledb demo array_minio = "s3://tiledb-demo/tiledb_minio_demo/" tiledb.from_numpy(array_minio, np.array([1.0, 2.0, 3.0]), ctx=tiledb.Ctx(config)) [TRUNCATED] |
After making these 2 changes, run the script and you should see the output below
% python3 tiledb-minio-demo.py [1. 2. 3.] |
If you run the script again it will fail with the below error because it will try to write again.
tiledb.cc.TileDBError: [TileDB::StorageManager] Error: Cannot create array; Array 's3://tiledb-demo/tiledb_minio_demo/' already exists |
Just comment out the following line and you can re-run it multiple times.
# tiledb.from_numpy(array_minio, np.array([1.0, 2.0, 3.0]), ctx=tiledb.Ctx(config)) |
% python3 tiledb-minio-demo.py [1. 2. 3.] % python3 tiledb-minio-demo.py [1. 2. 3.] |
Check the MinIO Play bucket to make sure the data is in there as expected
% docker run minio/mc ls play/tiledb-demo/tiledb_minio_demo/ [2024-01-15 16:45:04 UTC] 0B __commits/ [2024-01-15 16:45:04 UTC] 0B __fragments/ [2024-01-15 16:45:04 UTC] 0B __schema/ |
There you go, getting data into MinIO is that simple. Did you get the same results as earlier? You should have, but if you didn't there are a few things you can check out.
Common Pitfalls
We’ll look at some common errors you might encounter while trying to read/write to MinIO.
If your access key and secret key are incorrect, you should expect to see an error message like below
tiledb.cc.TileDBError: [TileDB::S3] Error: Error while listing with prefix 's3://tiledb-demo/tiledb_minio_demo/__schema/'... The request signature we calculated does not match the signature you provided. Check your key and signing method. |
Next, you need to ensure the hostname and port are correct, without a proper endpoint these are the errors you would encounter
Incorrect Hostname:
tiledb.cc.TileDBError: [TileDB::S3] Error: … Couldn't resolve host name |
Incorrect Port:
tiledb.cc.TileDBError: [TileDB::S3] Error: … Couldn't connect to server |
Last but not least, one of the most cryptic errors I’ve seen is the following
tiledb.cc.TileDBError: [TileDB::S3] Error: … [HTTP Response Code: -1] [Remote IP: 98.44.32.5] : curlCode: 56, Failure when receiving data from the peer |
After a ton of debugging it turns out that if you are connecting using http but the MinIO server has TLS activated then you will see the above error. Just be sure the connection scheme is set to the right configuration, in this case, config["vfs.s3.scheme"] = "https".
Racks on Racks on Racks
There is a rap song (you can search for it) where they rap about having stacks on stacks on stacks of *cough* cash. But there is another rap song where they claim they have so many stacks of cash that they can’t be called “stacks” anymore, they are now “racks”. Essentially when your stacks get so big and so high you need racks on racks on racks to store your stacks of cash.
This is an apt comparison because your stacks of data mean as much (or more) to you as the stacks of cash they're rapping about. If only there was something like MinIO to keep all your objects – physical or virtual – safe and readily accessible.
With MinIO in the mix, you can easily scale TileDB to multiple racks across multiple datacenters with relative ease. You also get all the features that make MinIO great like Security and Access Control, Tiering, Object Locking and Retention, Key Encryption Service (KES), among others right out of the box. By having all your data in MinIO, you decrease required storage complexity and therefore realize considerable savings on data storage costs, while at the same time running MinIO on commodity hardware provides the best possible performance-to-cost ratio. MinIO supercharges your TileDB engine with industry-leading performance that makes querying a joy.
We’ve added the code snippets used in this blog to a git repository. If you have any questions on how to connect MinIO to TileDB or migrate data into MinIO be sure to reach out to us on Slack!