A Guide to Decoupled Storage with StarRocks and MinIO

A Guide to Decoupled Storage with StarRocks and MinIO

Decoupled storage and compute is a fundamental architectural principle of the modern data stack. This separation allows enterprises to independently scale their compute resources and storage capacity, optimizing both for cost-efficiency and performance. Starting from version 3.0, StarRocks introduced the storage-compute separation architecture, where data storage is separated from compute nodes, allowing for independent scaling of storage and compute.

By entrusting best-in-class object storage to handle its specialized functions and leaving query performance to the expertise of database vendors, this approach maximizes the strengths of each component. This relationship is very clearly realized when using MinIO with StarRocks in the decoupled compute mode. Good things happen when you combine high-performance analytics with high-performance object storage

To be considered truly modern, your data stack must be able to be deployed with a single Kubernetes YAML file multiple times a day. These two Kubernetes-native, open-source projects meet that requirement. Putting on a finer point on what it means to be built for the cloud rather than to be built for a specific public cloud.  

This article provides a step-by-step tutorial for getting started with StarRocks with MinIO as primary storage. It is based on StarRock’s tutorial in which more configuration details can be found.

StarRocks Explained

StarRocks is a high-performance, massively parallel processing (MPP) OLAP database that allows data warehouse query performance on the data lakehouse. It boasts features like a fully vectorized engine, a newly designed cost-based optimizer, and intelligent materialized views, all of which allow it to deliver exceptionally fast query speeds, especially for complex multi-table joins at sub-second response times. StarRocks supports various schemas, including flat, star, and snowflake schemas, and is compatible with MySQL protocols.  While using decoupled storage, you can access your data in the StarRocks internal format, Apache Iceberg, Apache Hudi, or Delta Lake within MinIO buckets. You can read more about StarRocks at http://starrocks.io.

Advantages of Decoupled Storage and Compute

To address the potential increase in network overhead caused by decoupling, StarRocks uses a local data cache for frequently accessed data to improve query performance. This architecture provides the following advantages: 

  • Cost Savings: Separating compute and storage allows for independent scaling and optimizing resource utilization that can lead to significant cost savings.
  • Flexibility: Decoupling allows for greater operational flexibility, enabling enterprises to select tools and services that are best suited to particular task.
  • Scalability: It also provides the ability to scale object storage and compute independently, leading to better agility and ultimately more precise scaling. 
  • Performance: Separating storage and compute allows for the optimization of each component for its specific task, leading to improved performance, faster processing times, and better data throughput.
  • Simplified Management: Decoupling storage and compute can lead to simplified management of the data stack, as it allows for easier movement of workloads and better resource allocation. 
  • Resource Isolation: The decoupled architecture enables resource isolation, ensuring that computational resources are matched to specific workloads.

What You’ll Learn

In this tutorial, you'll learn the essentials of running StarRocks and MinIO in Docker containers. You will go through the process of configuring StarRocks for shared data: you’ll load two public datasets and explore the data with SQL commands.

Prerequisites

You’ll use curl to download the YAML, datasets and to upload StarRocks. Check to see if you have curl installation by typing curl or curl.exe in your command prompt. If you don't have it, get curl here.

You’ll also need Docker Compose. You have the option to install the Docker Engine and Docker Compose binaries either separately or together using Docker Desktop. Installing Docker Desktop is the easiest option.

Verify that you have Docker Compose by executing:

docker compose version

You can access the MySQL instance in this tutorial using the MySQL CLI, eliminating the need for additional prerequisites. Alternatively, you can enhance your experience by downloading DBeaver or MySQL Workbench.

Getting Started

Open up a terminal and run this command to create a directory to work on and download the Docker compose file.

mkdir sr-quickstart
cd sr-quickstart
curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/docker-compose.yml

Instruct Docker Compose to launch the containers specified in the docker-compose.yml file in the background using the following command:

docker compose up -d

MinIO

You can use MinIO through the MinIO Console or with mc.  The screenshots and instructions below are for the MinIO Console. 

Navigate to http://localhost:9001/access-keys. The username and password are mentioned in the Docker compose file, set as minioadmin:minioadmin by default. 

Click on "Create access key +" and then “Create”.

SQL Client

You’ll need to connect to a SQL client to play around with the data. You can optionally use DBeaver. Create a new database connection in DBeaver. Then select the MySQL database driver.

Configure the following fields

  • Port: 9030
  • Server Hose: localhost

Username: root

Test the connection, and click Finish if the test succeeds.

Alternatively, if you don't share my love for DBeaver, or you might just feel more comfortable at the command line, you could use the MySQL CLI in the starrocks-fe container. To connect to StarRocks, go to the same directory as your `docker-compose.yml and run the following command.

docker compose exec starrocks-fe \
mysql -P9030 -h127.0.0.1 -uroot --prompt="StarRocks > "

Create a Bucket

Run the following in your SQL Client, making sure that you use the access key and secret that you previously created in the MinIO Console:

CREATE STORAGE VOLUME shared
TYPE = S3
LOCATIONS = ("s3://starrocks/shared/")
PROPERTIES
(
   "enabled" = "true",
   "aws.s3.endpoint" = "http://minio:9000",
   "aws.s3.use_aws_sdk_default_behavior" = "false",
   "aws.s3.enable_ssl" = "false",
   "aws.s3.use_instance_profile" = "false",
   "aws.s3.access_key" = "{YOUR ACCESS KEY}",
   "aws.s3.secret_key"= "{YOUR SECRET KEY}"
);

SET shared AS DEFAULT STORAGE VOLUME;

Navigate to http://localhost:9001/buckets to see that the starrocks bucket has been successfully been created.

Download the Data

Run the command below in a terminal window to open a Bash shell within the starrocks-fe container, allowing you to interact with the container's file system and execute commands inside it.

docker compose exec starrocks-fe bash

Run these commands to make a quickstart directory inside the container.

mkdir quickstart
cd quickstart

Run these commands to download the two data sets into the folder you just created.

curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/NYPD_Crash_Data.csv
curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/72505394728.csv

Create Tables

Run these commands at the StarRocks > prompt in the terminal window where you connected to StarRocks with your SQL client. 

CREATE DATABASE IF NOT EXISTS quickstart;
USE quickstart;

Your terminal should look something like this:

StarRocks > CREATE DATABASE IF NOT EXISTS quickstart;
Query OK, 0 rows affected (0.02 sec)
StarRocks > USE quickstart;
Database changed
StarRocks > 

Create Tables in SQL Client

Back in DBeaver, or the SQL Client of your choice, execute the following commands to create tables for your data.

USE quickstart;
CREATE TABLE IF NOT EXISTS crashdata (
   CRASH_DATE DATETIME,
   BOROUGH STRING,
   ZIP_CODE STRING,
   LATITUDE INT,
   LONGITUDE INT,
   LOCATION STRING,
   ON_STREET_NAME STRING,
   CROSS_STREET_NAME STRING,
   OFF_STREET_NAME STRING,
   CONTRIBUTING_FACTOR_VEHICLE_1 STRING,
   CONTRIBUTING_FACTOR_VEHICLE_2 STRING,
   COLLISION_ID INT,
   VEHICLE_TYPE_CODE_1 STRING,
   VEHICLE_TYPE_CODE_2 STRING
);
CREATE TABLE IF NOT EXISTS weatherdata (
   DATE DATETIME,
   NAME STRING,
   HourlyDewPointTemperature STRING,
   HourlyDryBulbTemperature STRING,
   HourlyPrecipitation STRING,
   HourlyPresentWeatherType STRING,
   HourlyPressureChange STRING,
   HourlyPressureTendency STRING,
   HourlyRelativeHumidity STRING,
   HourlySkyConditions STRING,
   HourlyVisibility STRING,
   HourlyWetBulbTemperature STRING,
   HourlyWindDirection STRING,
   HourlyWindGustSpeed STRING,
   HourlyWindSpeed STRING
);

Load the Data

Switch to the terminal in which you downloaded the dataset and execute these curl commands from the shell running in the `starrocks-fe` container. Press ENTER when prompted for a password.

curl --location-trusted -u root             \
    -T ./NYPD_Crash_Data.csv                \
    -H "label:crashdata-0"                  \
    -H "column_separator:,"                 \
    -H "skip_header:1"                      \
    -H "enclose:\""                         \
    -H "max_filter_ratio:1"                 \
    -H "columns:tmp_CRASH_DATE, tmp_CRASH_TIME, CRASH_DATE=str_to_date(concat_ws(' ', tmp_CRASH_DATE, tmp_CRASH_TIME), '%m/%d/%Y %H:%i'),BOROUGH,ZIP_CODE,LATITUDE,LONGITUDE,LOCATION,ON_STREET_NAME,CROSS_STREET_NAME,OFF_STREET_NAME,NUMBER_OF_PERSONS_INJURED,NUMBER_OF_PERSONS_KILLED,NUMBER_OF_PEDESTRIANS_INJURED,NUMBER_OF_PEDESTRIANS_KILLED,NUMBER_OF_CYCLIST_INJURED,NUMBER_OF_CYCLIST_KILLED,NUMBER_OF_MOTORIST_INJURED,NUMBER_OF_MOTORIST_KILLED,CONTRIBUTING_FACTOR_VEHICLE_1,CONTRIBUTING_FACTOR_VEHICLE_2,CONTRIBUTING_FACTOR_VEHICLE_3,CONTRIBUTING_FACTOR_VEHICLE_4,CONTRIBUTING_FACTOR_VEHICLE_5,COLLISION_ID,VEHICLE_TYPE_CODE_1,VEHICLE_TYPE_CODE_2,VEHICLE_TYPE_CODE_3,VEHICLE_TYPE_CODE_4,VEHICLE_TYPE_CODE_5" \
    -XPUT http://localhost:8030/api/quickstart/crashdata/_stream_load
curl --location-trusted -u root             \
    -T ./72505394728.csv                    \
    -H "label:weather-0"                    \
    -H "column_separator:,"                 \
    -H "skip_header:1"                      \
    -H "enclose:\""                         \
    -H "max_filter_ratio:1"                 \
    -H "columns: STATION, DATE, LATITUDE, LONGITUDE, ELEVATION, NAME, REPORT_TYPE, SOURCE, HourlyAltimeterSetting, HourlyDewPointTemperature, HourlyDryBulbTemperature, HourlyPrecipitation, HourlyPresentWeatherType, HourlyPressureChange, HourlyPressureTendency, HourlyRelativeHumidity, HourlySkyConditions, HourlySeaLevelPressure, HourlyStationPressure, HourlyVisibility, HourlyWetBulbTemperature, HourlyWindDirection, HourlyWindGustSpeed, HourlyWindSpeed, Sunrise, Sunset, DailyAverageDewPointTemperature, DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, DailyAverageSeaLevelPressure, DailyAverageStationPressure, DailyAverageWetBulbTemperature, DailyAverageWindSpeed, DailyCoolingDegreeDays, DailyDepartureFromNormalAverageTemperature, DailyHeatingDegreeDays, DailyMaximumDryBulbTemperature, DailyMinimumDryBulbTemperature, DailyPeakWindDirection, DailyPeakWindSpeed, DailyPrecipitation, DailySnowDepth, DailySnowfall, DailySustainedWindDirection, DailySustainedWindSpeed, DailyWeather, MonthlyAverageRH, MonthlyDaysWithGT001Precip, MonthlyDaysWithGT010Precip, MonthlyDaysWithGT32Temp, MonthlyDaysWithGT90Temp, MonthlyDaysWithLT0Temp, MonthlyDaysWithLT32Temp, MonthlyDepartureFromNormalAverageTemperature, MonthlyDepartureFromNormalCoolingDegreeDays, MonthlyDepartureFromNormalHeatingDegreeDays, MonthlyDepartureFromNormalMaximumTemperature, MonthlyDepartureFromNormalMinimumTemperature, MonthlyDepartureFromNormalPrecipitation, MonthlyDewpointTemperature, MonthlyGreatestPrecip, MonthlyGreatestPrecipDate, MonthlyGreatestSnowDepth, MonthlyGreatestSnowDepthDate, MonthlyGreatestSnowfall, MonthlyGreatestSnowfallDate, MonthlyMaxSeaLevelPressureValue, MonthlyMaxSeaLevelPressureValueDate, MonthlyMaxSeaLevelPressureValueTime, MonthlyMaximumTemperature, MonthlyMeanTemperature, MonthlyMinSeaLevelPressureValue, MonthlyMinSeaLevelPressureValueDate, MonthlyMinSeaLevelPressureValueTime, MonthlyMinimumTemperature, MonthlySeaLevelPressure, MonthlyStationPressure, MonthlyTotalLiquidPrecipitation, MonthlyTotalSnowfall, MonthlyWetBulb, AWND, CDSD, CLDD, DSNW, HDSD, HTDD, NormalsCoolingDegreeDay, NormalsHeatingDegreeDay, ShortDurationEndDate005, ShortDurationEndDate010, ShortDurationEndDate015, ShortDurationEndDate020, ShortDurationEndDate030, ShortDurationEndDate045, ShortDurationEndDate060, ShortDurationEndDate080, ShortDurationEndDate100, ShortDurationEndDate120, ShortDurationEndDate150, ShortDurationEndDate180, ShortDurationPrecipitationValue005, ShortDurationPrecipitationValue010, ShortDurationPrecipitationValue015, ShortDurationPrecipitationValue020, ShortDurationPrecipitationValue030, ShortDurationPrecipitationValue045, ShortDurationPrecipitationValue060, ShortDurationPrecipitationValue080, ShortDurationPrecipitationValue100, ShortDurationPrecipitationValue120, ShortDurationPrecipitationValue150, ShortDurationPrecipitationValue180, REM, BackupDirection, BackupDistance, BackupDistanceUnit, BackupElements, BackupElevation, BackupEquipment, BackupLatitude, BackupLongitude, BackupName, WindEquipmentChangeDate" \
    -XPUT http://localhost:8030/api/quickstart/weatherdata/_stream_load

Return to MinIO at http://localhost:9001 to verify that the data has been successfully loaded.

Explore the Data

Return to your SQL Client. Let's run some queries against the data we just loaded. Get started by querying  the data to get a sense of the relationship between precipitation, crashes and day/time of the week.

SELECT COUNT(DISTINCT c.COLLISION_ID) AS Crashes,
       TRUNCATE(AVG(w.HourlyDryBulbTemperature), 1) AS Temp_F,
       MAX(w.HourlyPrecipitation) AS Precipitation,
       DATE_FORMAT(c.CRASH_DATE, '%d %b %Y %H:00') AS Hour
FROM crashdata c
LEFT JOIN weatherdata w
ON DATE_FORMAT(c.CRASH_DATE, '%Y-%m-%d %H:00:00') = DATE_FORMAT(w.DATE, '%Y-%m-%d %H:00:00')
WHERE DAYOFWEEK(c.CRASH_DATE) BETWEEN 2 AND 6
GROUP BY Hour
ORDER BY Crashes DESC
LIMIT 200;

Cloud-Native Analytics with StarRocks and The combination of StarRocks and MinIO empowers enterprises with a flexible, scalable and cost-efficient data architecture. The separation of compute and storage resources allows for independent scaling, optimizing resource utilization and leads to significant cost savings. This approach not only enhances performance but also simplifies management and ensures better resource isolation.

As data becomes a cornerstone of decision-making, the ability to analyze and derive insights efficiently and quickly is paramount. StarRocks, in tandem with MinIO, positions itself as a robust solution worthy of residing the modern data stack. 

Any questions? Reach out to us on Slack or talk to us at hello@min.io. We’d be happy to help.

Previous Post Next Post