All posts

Introducing Spark-Select for MinIO Data Lakes

Introducing Spark-Select for MinIO Data Lakes

When early object storage APIs were developed they focused on the efficient storage and retrieval of objects. Amazon’s success with S3 and its implementation of the robust S3 API quickly became the de facto standard for object storage in the cloud.

MinIO, recognizing this, invested heavily in creating the most compliant implementation of the S3 API outside of Amazon. This in turn, made MinIO the standard in private cloud object storage — as evidenced by the more than 200M Docker pulls to date.

As with any technology, however, the object storage API needs to evolve and adapt to changing user requirements and in this case those changing requirements are being driven by emerging big data, analytic and machine learning workflows. One dynamic area of evolution for the S3 API is helping users make the most of their data lakes.

This evolution is important because AI, ML/DL and other analytic approaches are taking central stage in enterprise data strategy today, and such workloads seldom bother about an object per se, instead they need access to filtered data which is relevant to a particular job.

This led to the creation of the S3 Select API which is essentially SQL query capabilities baked right into the object store. MinIO recently rolled out its implementation of the Select API as well. Users can execute Select queries on their objects, and retrieve a relevant subset of the object, instead of having to download the whole object.

In this post, we’ll talk about one of the most popular data analytics platforms in the big data ecosystem — Spark. Specifically, we’ll take a look at Select support in MinIO and how it complements Spark and similar frameworks. Finally we’ll take a look at recently released MinIO Spark-Select and understand how it improves query performance by leveraging SQL support in MinIO.

MinIO S3 Select API support

The typical data flow prior to the release of the Select API would look like this:

  • Applications download the whole object, using GetObject()
  • Load the object into local memory.
  • Start the query process while the object resides in memory.

With the S3 Select API, applications can now a download specific subset of an object — only the subset that satisfies given Select query. This directly translates into efficiency and performance:

  • Reduced bandwidth requirements
  • Optimizes compute resources and memory
  • With the smaller memory footprint, more jobs can be run in parallel — with same compute resources
  • As jobs finish faster, there is better utilization of analysts and domain experts

An application can add the S3 Select API using AWS SDK. Let us see an example of using MinIO Select API using aws-sdk-python.

To get started, you’ll need to have a MinIO server instance up and mc configured to talk to this instance. Then, download a sample csv file and upload it to relevant bucket on MinIO server.

$ curl "https://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2017_TotalPopulationBySex.csv" > TotalPopulation.csv
$ gzip TotalPopulation.csv$ mc mb myminio/mycsvbucket$ mc cp TotalPopulation.csv.gz myminio/mycsvbucket/sampledata/

Then install aws-sdk-python and use the below code snippet to query the csvfile right on MinIO server

Refer detailed documentation on MinIO Select API here: https://docs.minio.io/docs/minio-select-api-quickstart-guide.html

MinIO Spark-Select

With MinIO Select API support now generally available, any application can leverage this API to offload query jobs to the MinIO server itself.

However, an application like Spark, used by thousands of enterprises already, if integrated with Select API, would create tremendous impact on the data science landscape — making Spark jobs faster by an order of magnitude.

Technically, it makes perfect sense for Spark SQL to push down possible queries to MinIO, and load only the relevant subset of object to memory for further analysis. This will make Spark SQL faster, use lesser compute/memory resources and allow more Spark jobs to be run concurrently.

To support this, we recently released the Spark-Select project to integrate the Select API with Spark SQL. The Spark-Select project is available under Apache License V2.0 on

The Spark-Select project works as a Spark data source, implemented via DataFrame interface. At a very high level, Spark-Select works by converting incoming filters into SQL Select statements. It then sends these queries to MinIO. As MinIO responds with data subset based on Select query, Spark makes it available as a DataFrame, which is available for further operations as a regular DataFrame. As with any DataFrame, this data can now be leveraged by any other Spark library e.g. Spark MLlib, Spark Streaming and others.

Spark-Select currently supports JSON , CSV and Parquet file formats for query pushdowns. This means the object should be one of these types for the push down to work.


Spark-Select can be integrated with Spark via spark-shell , pyspark , spark-submit etc. You can also add it as Maven dependency, sbt-spark-package or a jar import.

Let’s see an example of using spark-select with spark-shell .

> $SPARK_HOME
  • After spark-shell is successfully invoked, execute the csv.scala file
scala> :load csv.scalaLoading examples/csv.scala...import org.apache.spark.sql._import org.apache.spark.sql.types._schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,false))df: org.apache.spark.sql.DataFrame = [name: string, age: int]+-------+---+|   name|age|+-------+---+|Michael| 31||   Andy| 30|| Justin| 19|+-------+---+
scala>

You can see, only the fields with value age > 19 are returned.

I hope above example gives an idea of how spark-select can help push down queries to MinIO server and help speed up the data analysis pipelines.

We welcome you to checkout the project at https://github.com/minio/spark-select.

Summary

The world of object storage isn’t just growing, it is changing at the same time. These changes can be seen in the growing number of analytic and machine learning touchpoints that are appearing in the ecosystem. While this post focuses on SQL Select from a strategic and tactical perspective, there will be more in the coming weeks that discuss other analytical frameworks.

The key takeaway is that object storage is rapidly moving past the traditional use cases of disaster recovery and archiving and into more dynamic use cases that emphasize analytics and machine learning. SQL, as the lingua franca of data, is critical to the success of those use cases.

All of this underscores the increased importance that object storage holds in the enterprise both in the public and private cloud. It also makes the distinction between legacy object storage and cloud native object storage solutions. Ultimately that means Amazon for the public cloud and Minio for the private cloud.