Keith Pijanowski - MinIO Blog (Page 2)

Setting Up A Development Machine with MLRun and MinIO

Keith Pijanowski Keith Pijanowski on AI/ML | 10 May 2024

MLOps is to machine learning what DevOps is to traditional software development. Both are a set of practices and principles aimed at improving collaboration between engineering teams (the Dev or ML) and IT operations (Ops) teams. The goal is to streamline the development lifecycle, from planning and development to deployment and operations, using automation. One of the primary benefits of

Improve RAG Performance with Open-Parse Intelligent Chunking

Keith Pijanowski Keith Pijanowski on AI/ML | 24 April 2024

If you are implementing a generative AI solution using Large Language Models (LLMs), you should consider a strategy that uses Retrieval-Augmented Generation (RAG) to build contextually aware prompts for your LLM. An important process that occurs in the preproduction pipeline of a RAG-enabled LLM is the chunking of document text so that only the most relevant sections of a document

The Architect’s Guide: A Modern Datalake Reference Architecture

Keith Pijanowski Keith Pijanowski on Modern Data Lakes | 5 April 2024

An abbreviated version of this post appeared on The New Stack on March 26th, 2024. Businesses aiming to maximize their data assets are adopting scalable, flexible, and unified data storage and analytics approaches. This trend is driven by enterprise architects tasked with crafting infrastructures that align with evolving business demands. A Modern Datalake architecture addresses this need by integrating the

The Full Stack AI Engineer: A Modern-Day Polymath

Keith Pijanowski Keith Pijanowski on AI/ML | 2 April 2024

Anyone who has worked in a team environment knows that every successful team has one go-to person—that special individual who can help you regardless of the nature of your problem. On a traditional software development team, this individual is an expert programmer and is also an expert in one other technology, which could be a database technology like Snowflake

Architect’s Guide to a Reference Architecture for an AI/ML Datalake

Keith Pijanowski Keith Pijanowski on Architect's Guide | 26 March 2024

An abbreviated version of this post appeared on The New Stack on March 19th, 2024. In enterprise artificial intelligence, there are two main types of models: discriminative and generative. Discriminative models are used to classify or predict data, while generative models are used to create new data. Even though Generative AI has dominated the news of late, organizations are still

MinIO Cache: A Distributed DRAM Cache for Ultra-Performance

Keith Pijanowski Keith Pijanowski on AI/ML | 12 March 2024

As the computing world has evolved and the price of DRAM has plummeted, we find that server configurations often come with 500GB or more of DRAM. When you are dealing with larger deployments, even those with ultra-dense NVMe drives, the number of servers multiplied by the DRAM on those servers can quickly add up – often to several TBs. That DRAM

Hungry GPUs Need Fast Object Storage

Keith Pijanowski Keith Pijanowski on AI/ML | 31 January 2024

A chain is as strong as its weakest link - and your AI/ML infrastructure is only as fast as your slowest component. If you train machine learning models with GPUs, then your weak link may be your storage solution. The result is what I call the “Starving GPU Problem.” The Starving GPU problem occurs when your network or your

The Strengths, Weaknesses and Dangers of LLMs

Sidharth Rajaram

Sidharth Rajaram @sidharrrrrth , Keith Pijanowski Keith Pijanowski on AI/ML | 25 January 2024

The Strengths, Weaknesses and Dangers of LLMs

Much has been said lately about the wonders of Large Language Models (LLMs). Most of these accolades are deserved. Ask ChatGPT to describe the General Theory of Relativity and you will get a very good (and accurate) answer. However, at the end of the day ChatGPT is still a computer program (as are all other LLMs) that is blindly executing

Building an S3 Compliant Stock Market Data Lake with MinIO

Keith Pijanowski Keith Pijanowski on Delta Lake | 18 January 2024

In all my previous posts on MinIO, where I had to write code, I used MinIO’s Python SDK, which is documented here. I prefer this SDK because it is easy to use and it provides programmatic access to MinIO’s enterprise features, such as Lifecycle Management, Object Locking, Bucket Notifications, and Site Replication. (I showed how to set up

Distributed Training and Experiment Tracking with Ray Train, MLflow, and MinIO

Keith Pijanowski Keith Pijanowski on AI/ML | 28 December 2023

Over the past few months, I have written about a number of different technologies (Ray Data, Ray Train, and MLflow). I thought it would make sense to pull them all together and deliver an easy-to-understand recipe for distributed data preprocessing and distributed training using a production-ready MLOPs tool for tracking and model serving. This post integrates the code I presented

Distributed Training with Ray Train and MinIO

Keith Pijanowski Keith Pijanowski on AI/ML | 20 December 2023

Most machine learning projects start off as a single-threaded proof of concept where each task is completed before the next task can begin. The single-threaded ML pipeline depicted below is an example. However, at some point, you will outgrow the pipeline shown above. This may be caused by datasets that no longer fit into the memory of a single process.

The Foundation of Your AI Infrastructure: A Modern Datalake

Keith Pijanowski Keith Pijanowski on AI/ML | 12 December 2023

Amid the fervor to adopt AI is a critical and often overlooked truth - the success of any AI initiative is intrinsically tied to the quality, reliability and performance of the underlying data infrastructure. If you don't have the proper foundation, you are limited in what you can build and therefore what you can achieve. Your data infrastructure

Distributed Data Processing with Ray Data and MinIO

Keith Pijanowski Keith Pijanowski on AI/ML | 4 December 2023

Introduction Distributed data processing is a key component of an efficient end-to-end distributed machine-learning training pipeline. This is true if you are building a basic neural network for statistical predictions where distributed training could mean each experiment runs in 10 minutes vs. an hour. It is also true if you are training or fine-tuning a Large Language Model (LLM) where

Generative AI for the Enterprise

Keith Pijanowski Keith Pijanowski on AI/ML | 8 November 2023

Introduction Generative AI represents the latest technique an enterprise can employ to unlock the data trapped within its boundaries. The easiest way to conceptualize what is possible with Generative AI is to imagine a customized Large Language Model - similar to the one powering ChatGPT - running inside your firewall. Now, this custom LLM is not the same as the

Integrating MinIO with Hugging Face Datasets

Keith Pijanowski Keith Pijanowski on AI/ML | 23 October 2023

Hugging Face's DatasetDict class is a part of the Datasets library and is designed to make working with datasets destined for any model found on the Hugging Face Hub efficient. As the name implies, the DatasetDict class is a dictionary of datasets. The best way to understand objects created from this class is to look at a quick

Fine-Tuning Large Language Models with Hugging Face and MinIO

Keith Pijanowski Keith Pijanowski on AI/ML | 2 October 2023

Introduction In a previous post, I presented feature extraction, which is a technique for utilizing pre-trained Large Language Models (LLMs) to solve a custom problem without having to retrain the model. Feature extraction is one of two ways to use the knowledge a model already has for a task that is different from what the model was originally trained to

Feature Extraction with Large Language Models, Hugging Face and MinIO

Keith Pijanowski Keith Pijanowski on AI/ML | 26 September 2023

Introduction In this post, I am going to present a technique that every engineer should know for utilizing open source large models. Specifically, I will show how to perform feature extraction. Feature extraction is one of two ways to use the knowledge a model already has for a task that is different from what the model was originally trained to

The Disruptive Nature of Data Lakehouses

Keith Pijanowski Keith Pijanowski on Apache Iceberg | 12 September 2023

Introduction In 1997, Clayton Christensen, in his book The Innovator’s Dilemma, identified a pattern of innovation that tracked the capabilities, cost, and adoption by market segment between an incumbent and a new entrant. He labeled this pattern “Disruptive Innovation.” Not every successful product is disruptive - even if it causes well-established businesses to lose market share or even fail

Building a Data Lakehouse using Apache Iceberg and MinIO

Keith Pijanowski Keith Pijanowski on AI/ML | 31 August 2023

Introduction In a previous post, I provided an introduction to Apache Iceberg and showed how it uses MinIO for storage. I also showed how to set up a development machine. To do this, I used Docker Compose to install an Apache Spark container as the processing engine, a REST catalog, and MinIO for storage. I concluded with a very simple

A Developer’s Introduction to Apache Iceberg using MinIO

Keith Pijanowski Keith Pijanowski on AI/ML | 24 August 2023

Introduction Open Table Formats (OTFs) are a phenomenon in the data analytics world that has been gaining momentum recently. The promise of OTFs is as a solution that leverages distributed computing and distributed object stores to provide capabilities that exceed what is possible with a Data Warehouse. The open aspect of these formats gives organizations options when it comes to

MinIO Blog Posts by Keith Pijanowski