Time to First Byte and Streaming Media
It wasn’t so long ago that the only streaming media you received at home was over a cable service using a Set Top Box (STB) or over the air through an antenna. What most folks don’t realize is that they both work off of the same principle: the content is continually streamed and the antenna tunes to frequencies over the air while the STB tunes to frequencies over the coaxial cable.
And then the Internet caught up – now a majority of content is streamed over the Internet. While the exact percentage of Internet traffic that is consumed by streaming content varies from report to report, they all agree on two things: that streaming content accounts for the majority of the traffic on the Internet and that that percentage is increasing year over year.
What is Time to First Byte (TTFB) and Why Does it Matter for Internet Video?
Unlike streaming over the air or over cable which are constant broadcasts to devices that are authorized to receive the content, in the case of Internet video, the user’s device needs to make a request for the content and only after the handshake and authorization is complete, will the video start to stream.
Time to first byte is the time it takes for the first byte of a video stream to be available on the user’s device. In more practical terms, it is the time it takes for someone to hit play on a streaming video and for the video to actually start playing.
When there was no competition, users were more than willing to wait for up to 10-15 seconds for the content to start streaming over the Internet. Now that the novelty has worn off and the market is rife with competition, a delay of more than a few seconds (or even any delay) could mean the difference between keeping and losing a customer.
Challenges Faced on the Server Side
Streaming providers face multiple significant challenges. While edge caches and CDNs improve the user experience by caching some of the more common content, it is still not all of the content and there is still the issue of getting the content from the origin servers.
They also have to deal with multiple different use cases: Long residing video-on-demand (VOD) assets (e.g. movies, shows, etc.), short residing VOD assets (e.g. clips of live sporting events, etc.) and DVR assets (videos that users decide to “store”).
Traditional SAN and NAS solutions are great for transactional data like databases, sharing files, etc., but are not designed for streaming content. Between the protocols and the many layers of software that are required to provide scale and high availability, they are the wrong choice for streaming data.
Since many object stores were adapted from the vendor’s file system architecture (vs. a true object storage approach) they require external metadata databases. This is problematic for a number of reasons which are detailed in this post. One of these is that it eliminates the possibility of strict consistency because the object and meta data are written separately, resulting in a system that is prone to data corruption.
The separate metadata store also causes huge performance and scalability issues because of the extra lookup call overhead and the need to scale the object store and metadata store separately.
Content-addressed storage (CAS) and legacy object storage are designed for archival/secondary storage use cases. Prominent examples of these types of storage use Cassandra as their metadata store, which hurts performance and consistency. We have written about this in the past.
The scalability issues become even more acute when there is a high churn in the assets (i.e., videos are added and deleted very often).
MinIO - The Shortest Path to Data
At MinIO, we have built our object store from the ground up to tackle these issues.
Let’s go through the architectural decisions that were made to make sure that applications had the shortest possible path to the data that they are trying to access.
Go and Go Assembly
MinIO is written as single process light-weight green threads (also called Goroutines) designed for massively concurrent operations.
SIMD Optimizations
Modern CPUs have more than enough capabilities to eliminate the need for external accelerators or custom ASICs. One of the most important features of modern CPUs is the ability for them to execute what are called Single Instruction, Multiple Data (SIMD) instructions. MinIO takes advantage of these capabilities in Intel, AMD, ARM and PowerPC CPUs.
The result is that operations that are typically CPU-intensive like Erasure Coding, Encryption, Checksum verification, etc. are now fast and free. There is no need for external accelerators, for example GPUs, because the bottleneck is no longer at the compute layer, but is dependent on the I/O subsystem (network, drive, and system bus).
Not a Gateway to a Traditional File System
MinIO is purpose-built to be an object store. It is not just a gateway or an API proxy to a standard file system like a SAN or NAS.
Most importantly, MinIO utilizes direct attached storage, i.e, drives attached to the computers. There is no additional network or software layer involved to access the video assets. The local PCIe bus is faster than any network access to drives.
Simple
There are no unnecessary or additional software layers in the MinIO software stack.
The microservices architecture with a native S3 API interface is simply a web server with handlers. This simplicity lends itself to a robust and highly performant platform.
Strict Consistency with No External Metadata Store
As already described before, an external metadata store is one of the biggest bottlenecks to a highly performant and scalable object store.
MinIO is strictly consistent with both the object and the metadata because they are written together as a single operation. Not only is there a guarantee that an object and metadata are written when a 200 OK is returned, but there is also no longer the bottleneck of a separate metadata store.
Writing metadata with the object eliminates any potential issues caused by the high churn of assets where high numbers of deletes create big challenges with metadata stores.
MinIO writes metadata in a compressed msgpack format to reduce storage space and improve retrieval speed.
No Global Locks
One of the most challenging parts of creating a truly distributed system is that almost all of them require a global lock.
MinIO has completely eliminated the need for global locks by ensuring that locks are granular and at a per-object level. This way an operation on one object has no effect on any other object(s). Our versioning mechanism ensures that there is no resource contention and therefore no need for a global lock.
DirectIO
Kernel caching takes a huge toll on latency and memory management. MinIO bypasses kernel caching to optimize for both streaming performance as well as resiliency and durability.
Erasure Coding
The final MinIO design pattern that enables low TTFB at scale is Erasure Coding. Along with the stateless nature of the nodes, not only does this provide high tolerance for drive and node failure at a reasonable cost, but, since the metadata is written with every shard of the object, the entire object can be efficiently fetched from any drive.
Optimizations for Streaming Workloads
The previous section details MinIO’s advantages for any workloads that require an object store. MinIO also has optimizations specifically for streaming workloads.
Native Streaming Protocol
HTTP(S) is designed for direct streaming to an end user. MinIO has native support for the S3 interface and the HTTP protocol lends itself perfectly to streaming video.
Multipart Uploads
Videos are typically large files. MinIO allows for these video files to be broken up into a set of parts that can then be uploaded. This dramatically improves throughput because of the ability to upload parts in parallel. Also, smaller parts allow for quick recovery from network issues, the ability to pause and resume the uploading of parts over time, and more.
Streaming Range-gets
In order to get fine grained control over the video stream - for example, to continue from a specific point in the video or to provide a clip - it is important that the entire object does not first need to get loaded into memory and then have the operations performed on that object. Such inefficiencies will greatly increase the TTFB and the amount of memory used. MinIO supports range-gets where a partial object is available as a stream. Applications are able to provide the offset, the required length and even the specific version of the object.
Streaming Block Support
To ensure that objects can be read efficiently while streaming, MinIO adds checksums at each block and not simply at the object level. This eliminates the need to load the entire object before starting to stream. This also significantly reduces memory utilization since the entire object does not have to be loaded into memory.
The checksum is also calculated using the Highway Hash algorithm, which is much more efficient than the more commonly used CRC32 method. What is more is that CRC32 is prone to hash collisions, and thus not reliable enough in protecting against silent data corruption. MinIO has optimized the Highway Hash to achieve near memory speed hashing while retaining the hashing strength.
Object shards are placed deterministically as part of the Erasure Coding mechanism. This allows for read parallelism which is not possible in a RAID or NAS setup. Again, since every shard has the entire metadata, it can fetch the entire object.
Benchmark Results
We conducted a series of tests to showcase the consistency of performance for streaming workloads.
Test Configuration
The test was done on an 8-node cluster. Please note that these are older systems and the NVMe drives are only rated at about 2GB/s sequential reads. The newer NVMe drives show between 3.5GB/s on PCIe 3.0 to 8GB/s on PCIe 4.0. PCIe 5.0 is 12 GB/s but it is price prohibitive at the moment. The implication is that the numbers will only get better.
Each node comprised:
- Memory: 384GB
- Network: 100Gbps
- 10 x 1TB NVMe drives
Result #1: TTFB is very low even at high concurrency rates
The first test we ran measured the effect of high concurrency rates on TTFB. A 1 MB object size was chosen because that is what is used by streaming providers to allow for better CDN caching.
The results in the 75th and 90th Percentile are in milliseconds:
Object Size | Concurrency | 75th Percentile (ms) | 90th Percentile (ms) |
1 MiB | 160 | 26 | 35 |
1 MiB | 240 | 35 | 46 |
1 MiB | 320 | 45 | 56 |
1 MiB | 480 | 72 | 90 |
1 MiB | 640 | 98 | 118 |
1 MiB | 720 | 105 | 126 |
1 MiB | 800 | 118 | 142 |
1 MiB | 960 | 137 | 163 |
1 MiB | 1600 | 205 | 241 |
1 MiB | 2400 | 289 | 339 |
1 MiB | 4000 | 449 | 525 |
MinIO provides extremely consistent performance under concurrent load. At 4000 concurrent requests, the TTFB remains under 0.5 seconds for the 75th percentile.
Result #2: TTFB for offset reads is independent of object sizes for a given concurrency
This test evaluates the effect of reading from different offsets of objects of different sizes. Here larger files were used to ensure that the seek values were high enough to be significant.
The results in the 75th and 90th percentile are in milliseconds:
Concurrency | Object Size | Offset | 75th Percentile (ms) | 90th Percentile (ms) |
160 | 64 MiB | 10B -> 100KiB | 14 | 20 |
160 | 64 MiB | 100KiB -> 1MiB | 14 | 19 |
160 | 64 MiB | 1MiB -> 10MiB | 13 | 19 |
160 | 64 MiB | 10MiB -> 64MiB | 15 | 21 |
160 | 100 MiB | 10B -> 100KiB | 13 | 20 |
160 | 100 MiB | 100KiB -> 10MiB | 13 | 19 |
160 | 100 MiB | 10MiB -> 100MiB | 14 | 20 |
The results show that, for a fixed concurrency, the TTFB for offset reads is consistent across object sizes.
Result #3: TTFB for offset reads is consistent for a given concurrency
The next test measured the effect of reading from different offsets in an object with different increasing concurrency. Here too, larger files were used to ensure that the seek values were high enough to be significant.
The results in the 75th and 90th percentile are in milliseconds:
Object Size | Concurrency | Offset | 75th Percentile (ms) | 90th Percentile (ms) |
64 MiB | 160 | 10B -> 100KiB | 14 | 20 |
64 MiB | 160 | 100KiB -> 1MiB | 14 | 19 |
64 MiB | 160 | 1MiB -> 10MiB | 13 | 19 |
64 MiB | 160 | 10MiB -> 64MiB | 15 | 21 |
64 MiB | 640 | 10B -> 100KiB | 131 | 158 |
64 MiB | 640 | 100KiB -> 1MiB | 124 | 144 |
64 MiB | 640 | 1MiB -> 10MiB | 124 | 144 |
64 MiB | 640 | 10MiB -> 64MiB | 125 | 146 |
64 MiB | 800 | 10B -> 100KiB | 174 | 216 |
64 MiB | 800 | 100KiB -> 1MiB | 169 | 204 |
64 MiB | 800 | 1MiB -> 10MiB | 170 | 206 |
The results show that, for a given concurrency, the TTFB is independent of what the offset in the object that was being read from.
Don’t just take our word - benchmark it yourself.
With the majority of Internet traffic being streaming, TTFB has emerged as a significant measure of performance. It’s no secret that a slow TTFB degrades customer experience and can lead to significant churn. As this blog post demonstrates, objects served from MinIO enjoy a low TTFB regardless of size, offset and concurrency.
Come talk to us (hello@min.io) or download MinIO and try it yourself. You’ll discover why some of the largest streaming companies in the world standardize their infrastructure on MinIO.