Simplify Data Pipelines

Simplify Data Pipelines

Enterprise data lakehouses are a necessity for any company competing in today's data-driven world. For a useful centralized data lake to exist, data must be ingested from disparate sources. Since this generated data comes from multiple sources, each with its own format and protocol, getting this data in and out of the data lake in a consumable form is a basic requirement. If data is a competitive advantage, then the timeliness of it flowing into the data lakehouse from all over the organization and partners is paramount for success.

Of course, the obvious solution to this is to rewrite all the applications to conform to the API of the data store, but we all know that is not practical. That was the way the industry thought when enterprises were building data warehouses, but the industry evolved. Instead, the best practice today is to build data pipelines, composed of multiple modular steps that move, operate on, filter, or enrich data as it moves towards a centralized data lakehouse. Everything is written to use the S3 API, and this evolves data pipelines from a manually created and maintained mess of code to an elegant, straightforward and automated way of feeding data lakehouses that requires far less maintenance. 

One solution that has gained traction with enterprise architects is to move away from the “one more layer” approach and instead to build data pipelines using the functionality built into the storage layer. The disorder of transferring files from FTP server to FTP server using scripts that run all over the enterprise becomes logical and orderly when collapsed into a data pipeline built on object storage. Leveraging Object Lambda to invoke scripts that move data through each stage of the pipeline is remarkably compact, streamlined and simple.    

At MinIO, we provide a standard interface – the S3 SPI – as our native interface, and all cloud-native applications already support this interface. One of the fundamental guiding principles for us at MinIO is that systems must be simple. Simplicity reduces human errors and leads to automation. Having to rely on a complex external data pipeline infrastructure is an antithesis to that principle.

We created the MinIO Batch Framework to simplify building and operating data pipelines and make it easy to transfer data to the enterprise data lakehouse. Then we developed our first built-in pipeline, Batch Replication, providing the ability to move customer selected data in batches between MinIO deployments. Needless to say, enterprise adoption of Batch Replication was so rapid that we knew we had delivered something really useful. 

Then we addressed one of the most common protocols used to move content in the enterprise, a protocol that is almost as old as the internet itself - FTP and SFTP. As of RELEASE.2023-04-20T17-56-55Z, we've added FTP/SFTP functionality to MinIO Server that you can enable by adding two lines to your MinIO configuration. MinIO Server not only eliminates the need for any external FTP server to proxy and translate FTP traffic to S3, but it also provides a fully distributed highly scalable FTP server. The same mechanisms that make MinIO great for S3 also make MinIO great for FTP/SFTP – stellar performance, data durability, fault-tolerance and security.  

We added these methods for building straightforward data pipelines to make it simpler to build a  data lakehouse with MinIO. Pipelines feed the data lakehouse data in any format and the S3 API exposes data to cloud-native analytics and AI/ML workloads. Please see MinIO for Modern Datalakes for examples. 

A lot of enterprises are locked into FTP because a good deal of proprietary business software is limited to using FTP for data import/export. Some software archiving and data science applications use FTP to distribute software artifacts or data sets. Many CRM, ERP and supply chain applications use SFTP to transfer data. These applications cannot simply switch from FTP or SFTP because this requires changing existing application internals and workflows. Changing out FTP or SFTP in favor of S3 is neither practical nor feasible. 

With MinIO, enterprises are not forced to make a choice. They can literally use FTP and SFTP to move that data into an S3-like data store. It is the principle of AND not OR. 

Evolving from myriad standalone FTP servers to MinIO is a tremendous enabler for data and development teams. They now have all the MinIO goodness at their disposal – data protection using erasure coding, performance, scale and performance at scale, integrations with existing management tooling, AWS-style IAM and PBAC, lifecycle management/tiering. The data lakehouse is fed with FTP and developers are still free to work with their favorite S3 compatible cloud-native tools.

Automate the Tedium of Feeding the Data Lakehouse

MinIO, with SFTP/FTP enabled, overcomes many of the pain points of building and maintaining data pipelines using other FTP servers. If you're using a series of scripts and FTP servers to ETL files around the enterprise, you now have an opportunity to simplify and enhance this process using cloud-native applications and MinIO Lambda Notifications to feed a data lakehouse of your choice – we love open table formats like Apache Iceberg, Delta and Apache Hudi – and accessed it via the S3 API for reporting, analytics and AI/ML. For example, see Orchestrate Complex Workflows Using Apache Kafka and MinIO and A Complete Workflow for Log File Anomaly Detection with R, H2O and MinIO

Between S3 and FTP, all enterprise data can make its way to MinIO. Store it in a bucket in its raw format. When the file is written to the bucket, a Lambda Notification triggers scripts to filter, transform, run ML models and move data as objects from bucket to bucket. It's all very neat and clean, and all the scripts use the S3 API so they are simpler to maintain than scripts written around application APIs and FTP servers that can move and file systems that can change. 

Freedom to Innovate

The architecture of using data pipelines to feed enterprise data lakehouses for reporting, analytics and ML will continue to exist for many years to come. Our goal is to remove the manual tedium from the process so developers can free themselves from drudgery and instead innovate and add value. Put MinIO to work and decrease the time and effort required to maintain data pipelines, even the data pipelines you've already built using FTP.  

Download MinIO today and start putting advanced object storage features to work in your data pipelines. Do you have questions about data pipelines? Email us at hello@min.io or ask on our community Slack channel.