LanceDB: Your Trusted Steed in the Joust Against Data Complexity

LanceDB: Your Trusted Steed in the Joust Against Data Complexity

Built on Lance, an open-source columnar data format, LanceDB has some interesting features that make it attractive for AI/ML. For example, LanceDB supports explicit and implicit vectorization with the ability to handle various data types. LanceDB is integrated with leading ML frameworks such as PyTorch and TensorFlow. Cooler still is LanceDB’s fast neighbor search which enables efficient retrieval of similar vectors using approximate nearest neighbor algorithms. All of these combine to create a vector database that is fast, easy to use and so lightweight it can be deployed anywhere.

LanceDB is capable of querying data in S3-compatible object storage. This combination is optimal for building high-performance, scalable, and cloud-native ML data storage and retrieval systems. MinIO brings performance and unparalleled flexibility across diverse hardware, locations, and cloud environments to the equation, making it the natural choice for such deployments.

Upon completion of this tutorial, you will be prepared to use LanceDB and MinIO to joust with any data challenge. 

What is Lance?

The Lance file format is a columnar data format optimized for ML workflows and datasets. It is designed to be easy and fast to version, query, and use for training, and is suitable for various data types, including images, videos, 3D point clouds, audio, and tabular data. Additionally, it supports high-performance random access: with Lance reporting benchmarks of 100 times faster than Parquet in queries. Lance’s speed is in part the result of being implemented in Rust, and its cloud-native design which includes features like zero-copy versioning and optimized vector operations.

One of its key features is the ability to perform vector search, allowing users to find nearest neighbors in under 1 millisecond and combine OLAP-queries with vector search. Other production applications for the lance format include edge-deployed low-latency vector databases for ML applications, large-scale storage, retrieval, and processing of multi-modal data in self-driving car companies, and billion-scale+ vector personalized search in e-commerce companies. Part of the appeal of the Lance file format is its compatibility with popular tools and platforms, such as Pandas, DuckDB, Polars, and Pyarrow. Even if you don’t use LanceDB, you can still leverage the Lance file format in your data stack.

Built for AI and Machine Learning

Vector databases like LanceDB offer distinct advantages for AI and machine learning applications, thanks to their efficient decoupled storage and compute architectures and retrieval of high-dimensional vector representations of data. Here are some key use cases:

Natural Language Processing (NLP):

Semantic Search: Find documents or passages similar to a query based on meaning, not just keywords. This powers chatbot responses, personalized content recommendations, and knowledge retrieval systems.

Question Answering: Understand and answer complex questions by finding relevant text passages based on semantic similarity.

Topic Modeling: Discover latent topics in large text collections, useful for document clustering and trend analysis.

Computer Vision:

Image and Video Retrieval: Search for similar images or videos based on visual content, crucial for content-based image retrieval, product search, and video analysis.

Object Detection and Classification: Improve the accuracy of object detection and classification models by efficiently retrieving similar training data.

Video Recommendation: Recommend similar videos based on the visual content of previously watched videos

Among the plethora of vector databases on the market, LanceDB is particularly well suited for AI and machine learning, because it supports querying on S3- compatible storage. Your data is everywhere, your database should be everywhere too. 

Architecting for Success

Using MinIO with LanceDB offers several benefits, including:

  • Scalability and Performance: MinIO’s cloud-native design is built for scale and high-performance storage and retrieval. By leveraging MinIO's scalability and performance, LanceDB can efficiently handle large amounts of data, making it well-suited for modern ML workloads.
  • High Availability and Fault Tolerance: MinIO is highly available, immutable, and highly durable. This ensures that data stored in MinIO is protected against hardware failures and provides high availability and fault tolerance, which are crucial for data-intensive applications like LanceDB.
  • Active-active replication: Multi-site, active-active replication enables near-synchronous replication of data between multiple MinIO deployments. This robust process ensures high durability and redundancy, making it ideal for shielding data in mission-critical production environments.

The combination of MinIO and LanceDB provides a high-performance scalable cloud-native solution for managing and analyzing large-scale ML datasets.

Requirements

To follow along with this tutorial, you will need to use Docker Compose. You can install the Docker Engine and Docker Compose binaries separately or together using Docker Desktop. The simplest option is to install Docker Desktop.

Ensure that Docker Compose is installed by running the following command:

docker compose version

You will also need to install Python. You can download Python from here. During installation, make sure to check the option to add Python to your system's PATH.

Optionally, you can choose to create a Virtual Environment. It's good practice to create a virtual environment to isolate dependencies. To do so, open a terminal and run:

python -m venv venv

To Activate the virtual environment:

On Windows:

.\venv\Scripts\activate

On macOS/Linux:

source venv/bin/activate

Getting Started

Begin by cloning the project from here. Once done, navigate to the folder where you downloaded the files in a terminal window and run:

docker-compose up minio

This will start up the MinIO container. You can navigate to ‘http://172.20.0.2:9001’ to take a look at the MinIO console. 

Log in with the username and password minioadmin:minioadmin.

Next, run the following command to create a MinIO bucket called lance

docker compose up mc

This command performs a series of MinIO Client (mc) commands within a shell. 

Here's a breakdown of each command:

until (/usr/bin/mc config host add minio http://minio:9000 minioadmin minioadmin) do echo '...waiting...' && sleep 1; done;: This command repeatedly attempts to configure a MinIO host named minio with the specified parameters (endpoint, access key, and secret key) until successful. During each attempt, it echoes a waiting message and pauses for 1 second.

/usr/bin/mc rm -r --force minio/lance;: This command forcefully removes (deletes) all contents within the lance bucket in MinIO.

/usr/bin/mc mb minio/lance;: This command creates a new bucket named lance in MinIO.

/usr/bin/mc policy set public minio/lance;: This command sets the policy of the lance bucket to public, allowing public read access.

exit 0;: This command ensures that the script exits with a status code of 0, indicating successful execution.

LanceDB

Unfortunately, LanceDB does not have native S3 support, and as a result, you will have to use something like boto3 to connect to the MinIO container you made. As LanceDB matures we look forward to native S3 support that will make the user experience all the better.

The sample script below will get you started.

Install the required packages using pip. Create a file named requirements.txt with the following content:

lancedb~=0.4.1
boto3~=1.34.9
botocore~=1.34.9

Then run the following command to install the packages:

pip install -r requirements.txt

You will need to change your credentials if your method of creating the MinIO container differs from the one outlined above.

Save the below script to a file, e.g., lancedb_script.py.

import lancedb
import os
import boto3
import botocore
import random

def generate_random_data(num_records):
    data = []
    for _ in range(num_records):
        record = {
            "vector": [random.uniform(0, 10), random.uniform(0, 10)],
            "item": f"item_{random.randint(1, 100)}",
            "price": round(random.uniform(5, 100), 2)
        }
        data.append(record)
    return data

def main():
    # Set credentials and region as environment variables
    os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
    os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
    os.environ["AWS_ENDPOINT"] = "http://localhost:9000"
    os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

    minio_bucket_name = "lance"

    # Create a boto3 session with path-style access
    session = boto3.Session()
    s3_client = session.client("s3", config=botocore.config.Config(s3={'addressing_style': 'path'}))

    # Connect to LanceDB using path-style URI and s3_client
    db_uri = f"s3://{minio_bucket_name}/"
    db = lancedb.connect(db_uri)

    # Create a table with more interesting data
    table = db.create_table("mytable", data=generate_random_data(100))

    # Open the table and perform a search
    result = table.search([5, 5]).limit(5).to_pandas()
    print(result)

if __name__ == "__main__":
    main()

This script will create a Lance table from randomly generated data and add it to your MinIO bucket. Again, if you don’t use the method in the previous section to create a bucket you will need to do so before running the script. Remember to change the sample script above to match what you name your MinIO bucket.

Finally, the script opens the table, without moving it out of MinIO, and uses Pandas to do a search and print the results.

The result of the script should look similar to the one below. Remember that the data itself is randomly generated each time. 

                   vector      item  price  _distance
0  [5.1022754, 5.1069164]   item_95  50.94   0.021891
1   [4.209107, 5.2760105]  item_100  69.34   0.701694
2     [5.23562, 4.102992]   item_96  99.86   0.860140
3   [5.7922664, 5.867489]   item_47  56.25   1.380223
4    [4.458882, 3.934825]   item_93   9.90   1.427407

Expand on your Own

There are many ways to build on this foundation offered in this tutorial to create performant, scalable and future-proofed ML/AI architectures. You have two cutting-edge and open-source building blocks in your arsenal – MinIO object storage and the LanceDB vector database –  consider this your winning ticket to the ML/AI tournament.

Don’t stop here. LanceDB offers a wide range of recipes and tutorials to expand on what you’ve built in this tutorial including a recently announced Udacity course on Building Generative AI Solutions with Vector Databases. Of particular interest is this recipe to chat with your documents. We are all for breaking down barriers to getting the most from your data. 

Please show us what you’re building and should you need guidance on your noble quest don’t hesitate to email us at hello@minio.io or join our round table on Slack. 

Previous Post Next Post