Microblink: Repatriating Compute and Storage with MinIO
Microblink is an AI company specializing in image detection. They got their start in the identity space with products like BlinkID, BlinkID Verify, and BlinkCard. Most recently, their image detection capabilities have led to products that can process other types of images. For example, product detection can be performed on receipts, whereby product descriptions on a receipt are used to lookup product SKUs and other details. Below is a full list and description of Microblink’s products.
BlinkID - Scan and extract data from identity documents such as driver's licenses and passports. Identity data is returned in a JSON format.
BlinkID Verify: Confirm the validity of an identity document. Scanned identity documents can be checked for the authenticity of any barcodes on the document, face photo tampering, photocopy detection, data inconsistencies, and screen detection, which detects that the identity document is really an image being displayed on another screen.
BlinkCard: Scan and extract data from credit cards. It also checks credit cards for fraud by detecting the use of a screen (the user does not physically possess the credit card), the presence of hands on the card (the user is actually holding the card), and a photocopy of the card.
BlinkReceipt: This product converts an image of a receipt into text. It can then take product descriptions or codes and look up SKUs and SKU-level data.
BlinkShelf: Product recognition, Users can scan grocery shelves using a mobile device and detect products, including a product’s universal product code (UPC).
At the heart of each of these products is an AI model trained for the appropriate image recognition task. However, many of these products go beyond simple image classification (fraudulent or valid). Many of their products turn images into structured data. Creating AI models that can do this level of image recognition requires hundreds of hours of training and multiple passes through your training data. It is a best practice to conduct multiple experiments to test different model architectures and hyperparameter options.
Let’s take a look at how Microblink’s data infrastructure has changed over the years to handle more demanding AI workloads that require increasing volumes of data.
In the Beginning
When Microblink first started, around the year 2012, their infrastructure was a collection of servers with data residing on the various file systems of those servers. If a dataset was needed for a workload residing on more than one server, then a copy of the dataset was created for every workload. Unfortunately, there was never a single source of truth. Imagine a dataset that gets updated with either new data or an engineer figures out a better set of features for the dataset (a technique commonly known as feature engineering). When these updates occurred, there was no way to know for sure who needed the update. Even if there was a record of who needed the update, a copy of the dataset would still need to be transmitted over the network, which is inefficient.
Every AI engineering team starts this way, and having multiple sources of truth is a common problem when manually managing data using file systems. Microblink decided to try a cloud vendor to solve their data problems.
The Move to the Cloud
The next version of Microblink’s AI data infrastructure utilized the Google Cloud Platform (GCP). The goal was to solve the “multiple sources of truth” problem created when data had to be copied from one server to another. However, as the data grew, GCP became a source of friction because GCP cloud storage could not keep up with the demands of training models. Additionally, GPUs are expensive, unavailable on demand and can not scale on demand. You have to buy big machines that are often under-utilized, thus incurring significant costs.
Microblink decided to move compute on-premise.
Repatriating Compute
To get around this problem, Microblink decided to set up synchronization between GCP and their on-premise data center. The idea being that GCP could be the single source of truth for all data, and training would occur locally where on-premise servers would be set up to provide quicker access to the data. In other words, GCP would house the master copy of the data. When a dataset was needed for training, synchronization would move it on-premise for high-speed access during model training. This initially provided quicker access to the data during training experiments, and it also reduced costs because the compute needed for training was now occurring on-premise, and compute in the cloud is expensive.
Unfortunately, synchronizing data between cloud and on-premise servers provided a set of new problems. The speed of the synchronization was not great, and the synchronization itself kept breaking down. So once again, Microblink was not really satisfied with how they were handling their data, and their data was growing quickly.
At about this same time, Microblink’s on-premise data center was getting equipped with Kubernetes clusters. Microblink decided that it was now time to do something robust and great and solve their data problems once and for all.
“Once we had Kubernetes in our data center, I told my team, This is prime time to build something robust and great.” Filip Suste, Engineering Manager - Platform Teams, Microblink
MinIO for Data Repatriation
Microblink decided to implement a cloud-native object storage solution that could be both a master copy of all their data and a platform for serving up data quickly while training models. They experimented with Ceph initially, but it proved to be too hard to maintain. They then came to MinIO. MinIO was much easier for Microblink to set up and maintain, but the biggest benefit they got from MinIO was improved performance. Presently, they have 75 TB (growing at 8 TB per year) of identity data and identity documents from around the world composed of small images (low resolution) and small documents resulting in a large number of objects. The speed and bandwidth that MinIO provides during model training allows them to run more experiments per day. It’s no secret that the more training experiments you can run then the faster new ideas can be proven, and the faster value can be delivered once a new idea is proven viable. Microblink currently runs about 30 experiments per day.
“A pleasant surprise for us was the performance increase we noticed from MinIO after a network upgrade.”
Cost savings were also an important benefit. Using MinIO to house all data meant that cloud storage costs were greatly reduced. Additionally, synchronization with the cloud was no longer needed, so ingress and egress charges were also greatly reduced. The bottom line was a 62% cost savings.
Today, Microblink uses GCP for data intake only. Once new data is processed and sent to MinIO, it is removed from cloud storage.
The Future
Roughly one-half of the 75 TB of data that Microblink stores is structured data that was pulled from the various images their customers scan. In order to improve the storage of this structured data and allow their products to do more with it, Microblink is going to build a Modern Datalake (also known as a Data Lakehouse). A Modern Datalake is one-half data lake (for unstructured data like images) and one-half data warehouse (for structured data), and both use MinIO. Modern Datalakes are made possible by the Open Table Formats - Apache Iceberg, Apache Hudi, and Deltalake put forth by Netflix, Uber, and Databricks, respectively. Once Microblink has their Modern Datalake in place, they will have a complete platform for all their data. They will be able to analyze their structured data using advanced capabilities from the Data Warehouse side of their Modern Datalake, and they will use their Data Lake for high-speed model training.