MinIO and Apache Tika: A Pattern for Text Extraction
Tl;dr: In this post, we will use MinIO Bucket Notifications and Apache Tika, for document text extraction, which is at the heart of critical downstream tasks like Large Language Model (LLM) training and Retrieval Augmented Generation (RAG). The Premise Let’s say that I want to construct a dataset of text that I can then use to fine-tune an
Read more...