Introducing Spark-Select for MinIO Data Lakes

Nitish Tiwari Nitish Tiwari on S3 17 March 2019

When early object storage APIs were developed they focused on the efficient storage and retrieval of objects. Amazon’s success with S3 and its implementation of the robust S3 API quickly became the de facto standard for object storage in the cloud.

MinIO, recognizing this, invested heavily in creating the most compliant implementation of the S3 API outside of Amazon. This in turn, made MinIO the standard in private cloud object storage — as evidenced by the more than 200M Docker pulls to date.

As with any technology, however, the object storage API needs to evolve and adapt to changing user requirements and in this case those changing requirements are being driven by emerging big data, analytic and machine learning workflows. One dynamic area of evolution for the S3 API is helping users make the most of their data lakes.

This evolution is important because AI, ML/DL and other analytic approaches are taking central stage in enterprise data strategy today, and such workloads seldom bother about an object per se, instead they need access to filtered data which is relevant to a particular job.

This led to the creation of the S3 Select API which is essentially SQL query capabilities baked right into the object store. MinIO recently rolled out its implementation of the Select API as well. Users can execute Select queries on their objects, and retrieve a relevant subset of the object, instead of having to download the whole object.

In this post, we’ll talk about one of the most popular data analytics platforms in the big data ecosystem — Spark. Specifically, we’ll take a look at Select support in MinIO and how it complements Spark and similar frameworks. Finally we’ll take a look at recently released MinIO Spark-Select and understand how it improves query performance by leveraging SQL support in MinIO.

MinIO S3 Select API support

The typical data flow prior to the release of the Select API would look like this:

Applications download the whole object, using GetObject()
Load the object into local memory.
Start the query process while the object resides in memory.

With the S3 Select API, applications can now a download specific subset of an object — only the subset that satisfies given Select query. This directly translates into efficiency and performance:

Reduced bandwidth requirements
Optimizes compute resources and memory
With the smaller memory footprint, more jobs can be run in parallel — with same compute resources
As jobs finish faster, there is better utilization of analysts and domain experts

An application can add the S3 Select API using AWS SDK. Let us see an example of using MinIO Select API using aws-sdk-python.

To get started, you’ll need to have a MinIO server instance up and mc configured to talk to this instance. Then, download a sample csv file and upload it to relevant bucket on MinIO server.

$ curl "https://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2017_TotalPopulationBySex.csv" > TotalPopulation.csv

$ gzip TotalPopulation.csv$ mc mb myminio/mycsvbucket$ mc cp TotalPopulation.csv.gz myminio/mycsvbucket/sampledata/

Then install aws-sdk-python and use the below code snippet to query the csvfile right on MinIO server

Refer detailed documentation on MinIO Select API here: https://docs.minio.io/docs/minio-select-api-quickstart-guide.html

MinIO Spark-Select

With MinIO Select API support now generally available, any application can leverage this API to offload query jobs to the MinIO server itself.

However, an application like Spark, used by thousands of enterprises already, if integrated with Select API, would create tremendous impact on the data science landscape — making Spark jobs faster by an order of magnitude.

Technically, it makes perfect sense for Spark SQL to push down possible queries to MinIO, and load only the relevant subset of object to memory for further analysis. This will make Spark SQL faster, use lesser compute/memory resources and allow more Spark jobs to be run concurrently.

To support this, we recently released the Spark-Select project to integrate the Select API with Spark SQL. The Spark-Select project is available under Apache License V2.0 on

GitHub (https://github.com/minio/spark-select)
Spark packages (https://spark-packages.org/package/minio/spark-select).

The Spark-Select project works as a Spark data source, implemented via DataFrame interface. At a very high level, Spark-Select works by converting incoming filters into SQL Select statements. It then sends these queries to MinIO. As MinIO responds with data subset based on Select query, Spark makes it available as a DataFrame, which is available for further operations as a regular DataFrame. As with any DataFrame, this data can now be leveraged by any other Spark library e.g. Spark MLlib, Spark Streaming and others.

Spark-Select currently supports JSON , CSV and Parquet file formats for query pushdowns. This means the object should be one of these types for the push down to work.

Spark-Select can be integrated with Spark via spark-shell , pyspark , spark-submit etc. You can also add it as Maven dependency, sbt-spark-package or a jar import.

Let’s see an example of using spark-select with spark-shell .

Start MinIO server and configure mc to interact with this server.
Create a bucket and upload a sample file$ curl "https://raw.githubusercontent.com/minio/spark-select/master/examples/people.csv" > people.csv$ mc mb myminio/sjm-airlines
$ mc cp people.csv myminio/sjm-airlines
Download the sample code from spark-select repo$ curl "https://raw.githubusercontent.com/minio/spark-select/master/examples/csv.scala" > csv.scala
Configure Spark with Minio. Detailed steps are available in this document: https://github.com/minio/cookbook/blob/master/docs/apache-spark-with-minio.md
While starting Spark, use --packages flag to add spark-select package

> $SPARK_HOME

After spark-shell is successfully invoked, execute the csv.scala file

scala> :load csv.scalaLoading examples/csv.scala...import org.apache.spark.sql._import org.apache.spark.sql.types._schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,false))df: org.apache.spark.sql.DataFrame = [name: string, age: int]+-------+---+|   name|age|+-------+---+|Michael| 31||   Andy| 30|| Justin| 19|+-------+---+

scala>

You can see, only the fields with value age > 19 are returned.

I hope above example gives an idea of how spark-select can help push down queries to MinIO server and help speed up the data analysis pipelines.

We welcome you to checkout the project at https://github.com/minio/spark-select.

Summary

The world of object storage isn’t just growing, it is changing at the same time. These changes can be seen in the growing number of analytic and machine learning touchpoints that are appearing in the ecosystem. While this post focuses on SQL Select from a strategic and tactical perspective, there will be more in the coming weeks that discuss other analytical frameworks.

The key takeaway is that object storage is rapidly moving past the traditional use cases of disaster recovery and archiving and into more dynamic use cases that emphasize analytics and machine learning. SQL, as the lingua franca of data, is critical to the success of those use cases.

All of this underscores the increased importance that object storage holds in the enterprise both in the public and private cloud. It also makes the distinction between legacy object storage and cloud native object storage solutions. Ultimately that means Amazon for the public cloud and Minio for the private cloud.

Previous Post Next Post

S3 Select Security Modern Data Lakes Apache Presto SQL Performance S3 Brand/Design Golang Programming Cloud Computing Microservices Docker AWS Kubernetes Apache Spark Open Source Benchmarks Integrations SUBNET Edge Computing Sidekick Secure-by-Design Splunk Veeam Intel Apache Nifi Immutability Software Defined Storage VMware Apache Arrow Hybrid Cloud Red Hat OpenShift Multicloud Scalability Cloud Field Day Cloud Native Apache Kafka Architect's Guide Awards Operator's Guide Security Advisory AI/ML AGPLv3 Apache Hadoop SFD Azure GCP Observability Analytics R H20 DirectPV DevOps Apache Iceberg Apache Hudi YouTube Summaries EKS Elastic Load Balancers CI/CD Object Storage Compliance opentelemetry BC/DR Storage Newsletter Predictions Best Practices Dremio New MinIO Features partners Small Files Databases DuckDB PostgreSQL Delta Lake Cloud Repatriation Python Object Lambdas Data Pipelines Cloud Operating Model Webhook ClickHouse Vector Database Events Value Engineering Change Data Capture Enterprise Object Store GitOps Case Study Equinix Certifications Snowflake Repatriation Migration Tabular Databricks

MinIO S3 Select API support

MinIO Spark-Select

Summary

Get a Quote

Select Plan

Choose Capacity