AIStor Tables: Technical Deep Dive

In our previous blog, we explained why Apache Iceberg is central to enterprise AI. Now we are introducing AIStor Tables, a new way to eliminate the external catalog service bottleneck in Iceberg deployments.

AIStor Tables is a native implementation of the Iceberg Catalog REST API, embedded directly into MinIO AIStor.

In this post, we will explore the Warehouse→Namespace→Table hierarchy that makes data organization intuitive, show how AI teams can make unstructured data discoverable through structured tables, and walk through practical steps for migrating existing deployments.

Warehouses, Namespaces and Tables

Our approach uses a Warehouse → Namespace → Table hierarchy that mirrors how data teams naturally think about organizing information. This three-tier structure directly maps to the database/schema/table mental model that most data professionals already understand.

At the top level, a warehouse serves as the highest organizational unit within AIStor. Think of it as equivalent to a database server or data warehouse instance. Within each warehouse, namespaces function like database schemas, providing logical separation for different projects, environments, or data domains. Finally, tables within each namespace contain the actual data.

This hierarchy eliminates the confusion that often arises with traditional object storage, where teams must mentally translate between bucket names, prefix paths, and their actual data organization needs. Instead of managing objects at paths like: s3://my-bucket/analytics/customer/tables/dim_customer/, teams work with intuitive references like analytics_warehouse.customer.dim_customer.

This granularity extends to which APIs are allowed. Security policies can be applied at any level of the hierarchy. Grant access to an entire warehouse for broad permissions, lock down specific namespaces for sensitive data, or provide table-level access for granular control.

For example, this policy grants read-only access to all namespaces and tables within a specific warehouse:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3TablesReadOnlyWarehouse",
      "Effect": "Allow",
      "Action": [
        "s3tables:GetTableBucket",
        "s3tables:ListNamespaces",
        "s3tables:GetNamespace",
        "s3tables:ListTables",
        "s3tables:GetTable"
      ],
      "Resource": [
        "arn:aws:s3tables:::bucket/my-warehouse",
        "arn:aws:s3tables:::bucket/my-warehouse/table/*"
      ]
    }
  ]
}

In addition to hierarchical permissions, AIStor Tables also introduces table-aware protections that understand the logical structure of your data. Direct object manipulation that could corrupt table metadata is restricted. Users cannot accidentally delete manifest files or create objects that would interfere with table operations. The storage layer actively maintains table integrity while still providing Iceberg REST API compatibility for legitimate operations.

Real-World Scenario: Setting Up Analytics Infrastructure

Let's see how AI development teams building agents can use AIStor Tables to make all their data discoverable by storing pointers to unstructured assets within structured Iceberg tables. Consider AI agents that need access to training videos, product images, and documentation stored across various AIStor buckets. Rather than having agents search through object storage directly, the team uses Iceberg tables as a unified catalog that points to all their unstructured assets.

The team creates their workspace via the native REST catalog API:

create_payload = {"name": "ai_platform"}
warehouse_url = f"{CATALOG_URL}/v1/warehouses"
response = requests.post(warehouse_url, data=json.dumps(create_payload), headers=signed_headers)

Next, they configure Spark:

config = {
    "spark.sql.catalog.aistor": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.aistor.type": "rest",
    "spark.sql.catalog.aistor.uri": "http://aistor:9000/_iceberg",
    "spark.sql.catalog.aistor.warehouse": "ai_platform",
    # S3 credentials and endpoint configuration
}

Now they create a namespace and table that serves as their asset catalog:

spark.sql("CREATE NAMESPACE IF NOT EXISTS asset_catalog")
spark.sql("""
    CREATE TABLE asset_catalog.media_assets (
        asset_id STRING,
        asset_type STRING,
        file_path STRING,
        file_size_bytes BIGINT,
        created_timestamp TIMESTAMP,
        processing_status STRING,
        tags ARRAY<STRING>,
        metadata MAP<STRING, STRING>
    ) USING iceberg
""")

The key innovation: they store pointers to assets located in their general AIStor buckets:

spark.sql("""
    INSERT INTO asset_catalog.media_assets VALUES 
    ('img_001', 'image', 's3://training-data-bucket/product_images/camera_001.jpg', 2048576, '2024-03-15T10:30:00', 'processed', array('product', 'camera'), map('resolution', '1920x1080', 'format', 'JPEG')),
    ('vid_001', 'video', 's3://training-data-bucket/demo_videos/demo_001.mp4', 104857600, '2024-03-15T11:45:00', 'ready', array('training', 'demo'), map('duration_sec', '120', 'codec', 'H.264')),
    ('doc_001', 'document', 's3://knowledge-base-bucket/manuals/user_guide.pdf', 1048576, '2024-03-15T12:15:00', 'indexed', array('documentation', 'user'), map('pages', '45', 'format', 'PDF'))
""")

Now AI agents can discover all available assets through simple queries, then access the actual files using the returned paths:

# Agent discovers all training videos with their locations
training_assets = spark.sql("""
    SELECT asset_id, file_path, metadata['duration_sec'] as duration
    FROM asset_catalog.media_assets 
    WHERE asset_type = 'video' 
    AND array_contains(tags, 'training')
    AND processing_status = 'ready'
""").toPandas()
print("Training videos available for AI agents:")
print(training_assets)# Agent finds all product images by resolution for computer vision tasks
product_images = spark.sql("""
    SELECT file_path, metadata['resolution'] as resolution, file_size_bytes
    FROM asset_catalog.media_assets 
    WHERE asset_type = 'image' 
    AND array_contains(tags, 'product')
    AND metadata['format'] = 'JPEG'
""").toPandas()
print("Product images with file locations:")
print(product_images)

This approach transforms scattered unstructured data across multiple buckets into a queryable, discoverable catalog. AI agents no longer need to know specific bucket names or navigate complex folder structures. Instead, they query the Iceberg table for assets matching their criteria and receive direct file paths they can immediately access for processing.

The structured table acts as a bridge between AI agents and unstructured data, making everything discoverable through familiar SQL interfaces.

Migration Path

Moving existing Iceberg implementations to AIStor Tables requires a systematic approach to metadata migration, but the process is straightforward for most deployments.

Metadata Discovery and Recreation

The migration process begins by inventorying your current catalog structure. Using standard Iceberg REST catalog APIs, you'll extract the complete hierarchy of namespaces and tables from your existing system. This includes not just table definitions, but also schema history, partition specifications, and snapshot metadata that defines your table evolution over time.

A migration script connects to your current external catalog service to list all namespaces, then iterates through each namespace to catalog all tables and their complete metadata. This information gets used to recreate the identical structure within AIStor warehouses and namespaces.

Flexible Deployment Options

AIStor Tables support multiple migration strategies depending on your operational requirements. For organizations that prefer a gradual transition, AIStor continues to support traditional Iceberg table storage without Tables. This means you can use AIStor as your object storage layer while still relying on external catalog services during the migration period, then switch to AIStor Tables when ready.

Ready to get started?

AIStor Tables brings the Iceberg Catalog API natively into object storage, eliminating the need to manage external catalog services. The result is unified data access where AI agents can discover everything from structured analytics data to unstructured training assets through familiar table interfaces.

With straightforward migration paths and flexible deployment options, teams can transition gradually while maintaining existing workflows.

This feature is currently available in tech preview. If you’d like to discuss details or explore your use case, submit the form below to connect directly with one of our engineers.