I was recently working with a customer on a performance optimization issue when I became aware of an NMVe phenomena which I had never come across, despite spending decades as a chief technology officer and chief data officer. As I dug into the phenomena, I learned that it is a bit of a shared secret among the NVMe crowd. The implications are significant, however, so I thought I would take some time to detail them in this post.
The phenomena has to do with NVMe write amplification and disk capacity. Because of this, even well-meaning and independently verified SSD performance benchmarks can ultimately be misleading and may result in the enterprise architecture team debugging performance bottlenecks for seemingly regular workloads.
At the heart of the matter are the different types of SSDs available. They are not all equal. As a result, careful consideration must be given to the usage and criticality of a given workload when deciding which SSD to deploy. What seems like an economic decision can easily result in a service disruption.
Pertinent information regarding NVMe SSD
SSDs may be attached to the motherboard via a Serial Advanced Technology Attachment (SATA) cable or PCI Express (PCIe) interface. NVMe stands for Non-Volatile Memory Express and means that data is not deleted between system reboots. NVMe SSDs define the fact that all data travels over PCIe on the motherboard. NVMe drives are faster than SATA.
The two key components of an NVMe SSD are the SSD Controller and the NAND Flash memory used to store data. It is important to note that NAND Flash memory has finite life expectancy. Cells wear out after a certain number of writes and drives exhibit naturally occurring error rates.
The SSD Controller allows for multi-channel (parallel) access to the NAND flash memory, enabling fast access to large amounts of data. However, each NAND Flash memory die cut has a unique raw bit error rate (RBER). The SSD controller is responsible for error correcting these naturally occurring RBERs. This means that the SSD controller uses many techniques to ensure that data received from the host is checked for integrity as it is written or read back from the NAND storage. The use of software mechanisms, as employed by MinIO, for bit rot protection and erasure coding further ensure that data protection is available even in the face of hardware failures.
All NAND flash storage degrades in its ability to reliably store bits over time. Repeated program and erase (P/E) cycles of NAND flash memory will result in bad blocks of storage cells, forcing the SSD controller to remove the bad blocks from circulation. When bad blocks are detected, the SSD controller is responsible for replacing these bad blocks with “Spare Blocks” that are available on the NVMe, this is known as Over Provisioning (OP). All NVMe SSDs have some percentage of the NAND flash cells that are reserved (OP’d) and will be leveraged as bad blocks need to be replaced.
The naturally occurring phenomenon of RBER is further compounded by how writes are handled on NVMe. While magnetic hard drives (HDD) write data on empty space, NVMe erases existing data first before it writes new data to the chip. This means that writes on brand new NVMe drives can occur without any controller functionality slowing things down. However, if data is already present on a block then old data must be erased before new data can be rewritten.
NVMe drives support only block rewrites. A block has multiple pages. The above problem may be referred to as a Write Amplification (WA) issue. Write Amplification factors make predictable performance of an NVMe drive extremely challenging.
In simple words, modifying data on an NVMe drive generates more background read/write activity than a simple write on an empty drive. This phenomenon is also known as the read/modify/write algorithm.
To combat performance issues drive manufacturers use a combination of Garbage Collection exercises and drive capacity Over Provisioning to help keep NAND blocks free and available for subsequent write activity. The operating systems have also introduced “trim” functionality to assist with free block availability.
We recommend that our NVMe users get more familiar with the concepts of wear leveling, sequential writes and random writes on an NVMe SSD to further understand the topic of Write Amplification.
Real World Example of Write Amplification
Most analytics workloads for AI/ML use cases require that random data from across a data repository be available for access into the machine learning models. S3 based object storage is the most common way to access machine learning data in the modern organization. Most modern enterprises use S3 to access data stored in Hybrid Cloud environments to run complex queries and drive intelligence.
Recently, I was called in to work with a mature deployment that started to experience performance issues. The client had been running MinIO on NVMe for a couple years and had not made any material changes to the configuration. It took us quite some time to determine the root cause (the aforementioned write amplification) and involved both hardware and software partners.
What we discovered was that these problems only began to manifest themselves once the client reached max drive capacity. At that point they started to experience service outages. Service outages may come in many forms, for example analytics queries timing out or web access not completing within reasonable timeframes.
Specifically, the NVMe drives with close to or above 80% capacity usage proved to be the culprits. In many cases the enterprise architecture teams deleted data to try and manage free space. After the manual disk usage management, certain services started to time out.
What we found was that drive performance (not dependent on any particular hardware vendor) went down from multiple GB/sec to only a couple of hundred MB/s.
After removing Minio completely from the equation and reproducing the problem by simply using simple commands like “dd”, we were able to identify that write amplification on NVMe drives has a very significant impact on data deployments if careful capacity analysis is not considered and followed in a production deployment.
Confirming our Findings
At this point, we began to contact all of our NVMe partners to talk about this at more length. All of them were aware of the overall issue and were quite forthcoming with us, sharing data and walking us through the problem. We then talked to other NVMe MinIO customers with suitable deployment time. They too had seen the associated performance issues and reported increasing frequency - but did not know the cause.
Here are our recommendations, in conjunction with our HW partners:
- Purchase enterprise class drives that take advantage of 25% or more over provisioning. Most enterprise class NVMe hardware manufacturers have a special class of drive for this purpose. Don’t try to save money in the short term if you are running mission critical workloads, buy the right equipment.
- If your workload is write once and read many, you should be fine. But try not to hit 90+% drive utilization. Many things can sting you at this point, including underlying OS, hardware and application performance.
- If your workload requires random writes, try to take advantage of the OS “Trim” functionality offered on most modern operating systems, and make sure your hardware offers appropriate over providing and aggressive garbage collection.
- Talk to Minio, we are here to help.
We hope you found this post enlightening. The process of discovery was for us and we are smarter for it.