Time to First Byte and Streaming Media

Time to First Byte and Streaming Media

It wasn’t so long ago that the only streaming media you received at home was over a cable service using a Set Top Box (STB) or over the air through an antenna. What most folks don’t realize is that they both work off of the same principle: the content is continually streamed and the antenna tunes to frequencies over the air while the STB tunes to frequencies over the coaxial cable.

And then the Internet caught up – now a majority of content is streamed over the Internet. While the exact percentage of Internet traffic that is consumed by streaming content varies from report to report, they all agree on two things: that streaming content accounts for the majority of the traffic on the Internet and that that percentage is increasing year over year.

What is Time to First Byte (TTFB) and Why Does it Matter for Internet Video?

Unlike streaming over the air or over cable which are constant broadcasts to devices that are authorized to receive the content, in the case of Internet video, the user’s device needs to make a request for the content and only after the handshake and authorization is complete, will the video start to stream.

Time to first byte is the time it takes for the first byte of a video stream to be available on the user’s device. In more practical terms, it is the time it takes for someone to hit play on a streaming video and for the video to actually start playing.

When there was no competition, users were more than willing to wait for up to 10-15 seconds for the content to start streaming over the Internet. Now that the novelty has worn off and the market is rife with competition, a delay of more than a few seconds (or even any delay) could mean the difference between keeping and losing a customer.

Challenges Faced on the Server Side

Streaming providers face multiple significant challenges. While edge caches and CDNs improve the user experience by caching some of the more common content, it is still not all of the content and there is still the issue of getting the content from the origin servers.

They also have to deal with multiple different use cases: Long residing video-on-demand (VOD) assets (e.g. movies, shows, etc.), short residing VOD assets (e.g. clips of live sporting events, etc.) and DVR assets (videos that users decide to “store”).

Traditional SAN and NAS solutions are great for transactional data like databases, sharing files, etc., but are not designed for streaming content. Between the protocols and the many layers of software that are required to provide scale and high availability, they are the wrong choice for streaming data.

Since many object stores were adapted from the vendor’s file system architecture (vs. a true object storage approach) they require external metadata databases. This is problematic for a number of reasons which are detailed in this post. One of these is that it eliminates the possibility of strict consistency because the object and meta data are written separately, resulting in a system that is prone to data corruption.

The separate metadata store also causes huge performance and scalability issues because of the extra lookup call overhead and the need to scale the object store and metadata store separately.

Content-addressed storage (CAS) and legacy object storage are designed for archival/secondary storage use cases. Prominent examples of these types of storage use Cassandra as their metadata store, which hurts performance and consistency. We have written about this in the past.

The scalability issues become even more acute when there is a high churn in the assets (i.e., videos are added and deleted very often).

MinIO - The Shortest Path to Data

At MinIO, we have built our object store from the ground up to tackle these issues.

Let’s go through the architectural decisions that were made to make sure that applications had the shortest possible path to the data that they are trying to access.

Go and Go Assembly

MinIO is written as single process light-weight green threads (also called Goroutines) designed for massively concurrent operations.

SIMD Optimizations

Modern CPUs have more than enough capabilities to eliminate the need for external accelerators or custom ASICs. One of the most important features of modern CPUs is the ability for them to execute what are called Single Instruction, Multiple Data (SIMD) instructions. MinIO takes advantage of these capabilities in Intel, AMD, ARM and PowerPC CPUs.

The result is that operations that are typically CPU-intensive like Erasure Coding, Encryption, Checksum verification, etc. are now fast and free. There is no need for external accelerators, for example GPUs, because the bottleneck is no longer at the compute layer, but is dependent on the I/O subsystem (network, drive, and system bus).

Not a Gateway to a Traditional File System

MinIO is purpose-built to be an object store. It is not just a gateway or an API proxy to a standard file system like a SAN or NAS.

Most importantly, MinIO utilizes direct attached storage, i.e, drives attached to the computers. There is no additional network or software layer involved to access the video assets. The local PCIe bus is faster than any network access to drives.

Simple

There are no unnecessary or additional software layers in the MinIO software stack.

The microservices architecture with a native S3 API interface is simply a web server with handlers. This simplicity lends itself to a robust and highly performant platform.

Strict Consistency with No External Metadata Store

As already described before, an external metadata store is one of the biggest bottlenecks to a highly performant and scalable object store.

MinIO is strictly consistent with both the object and the metadata because they are written together as a single operation. Not only is there a guarantee that an object and metadata are written when a 200 OK is returned, but there is also no longer the bottleneck of a separate metadata store.

Writing metadata with the object eliminates any potential issues caused by the high churn of assets where high numbers of deletes create big challenges with metadata stores.

MinIO writes metadata in a compressed msgpack format to reduce storage space and improve retrieval speed.

No Global Locks

One of the most challenging parts of creating a truly distributed system is that almost all of them require a global lock.

MinIO has completely eliminated the need for global locks by ensuring that locks are granular and at a per-object level. This way an operation on one object has no effect on any other object(s). Our versioning mechanism ensures that there is no resource contention and therefore no need for a global lock.

DirectIO

Kernel caching takes a huge toll on latency and memory management. MinIO bypasses kernel caching to optimize for both streaming performance as well as resiliency and durability.

Erasure Coding

The final MinIO design pattern that enables low TTFB at scale is Erasure Coding. Along with the stateless nature of the nodes, not only does this provide high tolerance for drive and node failure at a reasonable cost, but, since the metadata is written with every shard of the object, the entire object can be efficiently fetched from any drive.

Optimizations for Streaming Workloads

The previous section details MinIO’s advantages for any workloads that require an object store. MinIO also has optimizations specifically for streaming workloads.

Native Streaming Protocol

HTTP(S) is designed for direct streaming to an end user. MinIO has native support for the S3 interface and the HTTP protocol lends itself perfectly to streaming video.

Multipart Uploads

Videos are typically large files. MinIO allows for these video files to be broken up into a set of parts that can then be uploaded. This dramatically improves throughput because of the ability to upload parts in parallel. Also, smaller parts allow for quick recovery from network issues, the ability to pause and resume the uploading of parts over time, and more.

Streaming Range-gets

In order to get fine grained control over the video stream - for example, to continue from a specific point in the video or to provide a clip - it is important that the entire object does not first need to get loaded into memory and then have the operations performed on that object. Such inefficiencies will greatly increase the TTFB and the amount of memory used. MinIO supports range-gets where a partial object is available as a stream. Applications are able to provide the offset, the required length and even the specific version of the object.

Streaming Block Support

To ensure that objects can be read efficiently while streaming, MinIO adds checksums at each block and not simply at the object level. This eliminates the need to load the entire object before starting to stream. This also significantly reduces memory utilization since the entire object does not have to be loaded into memory.

The checksum is also calculated using the Highway Hash algorithm, which is much more efficient than the more commonly used CRC32 method. What is more is that CRC32 is prone to hash collisions, and thus not reliable enough in protecting against silent data corruption. MinIO has optimized the Highway Hash to achieve near memory speed hashing while retaining the hashing strength.

Object shards are placed deterministically as part of the Erasure Coding mechanism. This allows for read parallelism which is not possible in a RAID or NAS setup. Again, since every shard has the entire metadata, it can fetch the entire object.

Benchmark Results

We conducted a series of tests to showcase the consistency of performance for streaming workloads.

Test Configuration

The test was done on an 8-node cluster. Please note that these are older systems and the NVMe drives are only rated at about 2GB/s sequential reads. The newer NVMe drives show between 3.5GB/s on PCIe 3.0 to 8GB/s on PCIe 4.0. PCIe 5.0 is 12 GB/s but it is price prohibitive at the moment. The implication is that the numbers will only get better.

Each node comprised:

  • Memory: 384GB
  • Network: 100Gbps
  • 10 x 1TB NVMe drives

Result #1: TTFB is very low even at high concurrency rates

The first test we ran measured the effect of high concurrency rates on TTFB. A 1 MB object size was chosen because that is what is used by streaming providers to allow for better CDN caching.

The results in the 75th and 90th Percentile are in milliseconds:

Object Size

Concurrency

75th Percentile (ms)

90th Percentile (ms)

1 MiB

160

26

35

1 MiB

240

35

46

1 MiB

320

45

56

1 MiB

480

72

90

1 MiB

640

98

118

1 MiB

720

105

126

1 MiB

800

118

142

1 MiB

960

137

163

1 MiB

1600

205

241

1 MiB

2400

289

339

1 MiB

4000

449

525

MinIO provides extremely consistent performance under concurrent load. At 4000 concurrent requests, the TTFB remains under 0.5 seconds for the 75th percentile.

Result #2: TTFB for offset reads is independent of object sizes for a given concurrency

This test evaluates the effect of reading from different offsets of objects of different sizes. Here larger files were used to ensure that the seek values were high enough to be significant.

The results in the 75th and 90th percentile are in milliseconds:

Concurrency

Object Size

Offset

75th Percentile (ms)

90th Percentile (ms)

160

64 MiB

10B -> 100KiB

14

20

160

64 MiB

100KiB -> 1MiB

14

19

160

64 MiB

1MiB -> 10MiB

13

19

160

64 MiB

10MiB -> 64MiB

15

21

160

100 MiB

10B -> 100KiB

13

20

160

100 MiB

100KiB -> 10MiB

13

19

160

100 MiB

10MiB -> 100MiB

14

20

The results show that, for a fixed concurrency, the TTFB for offset reads is consistent across object sizes.

Result #3: TTFB for offset reads is consistent for a given concurrency

The next test measured the effect of reading from different offsets in an object with different increasing concurrency. Here too, larger files were used to ensure that the seek values were high enough to be significant.

The results in the 75th and 90th percentile are in milliseconds:

Object Size

Concurrency

Offset

75th Percentile (ms)

90th Percentile (ms)

64 MiB

160

10B -> 100KiB

14

20

64 MiB

160

100KiB -> 1MiB

14

19

64 MiB

160

1MiB -> 10MiB

13

19

64 MiB

160

10MiB -> 64MiB

15

21

64 MiB

640

10B -> 100KiB

131

158

64 MiB

640

100KiB -> 1MiB

124

144

64 MiB

640

1MiB -> 10MiB

124

144

64 MiB

640

10MiB -> 64MiB

125

146

64 MiB

800

10B -> 100KiB

174

216

64 MiB

800

100KiB -> 1MiB

169

204

64 MiB

800

1MiB -> 10MiB

170

206

The results show that, for a given concurrency, the TTFB is independent of what the offset in the object that was being read from.

Don’t just take our word - benchmark it yourself.

With the majority of Internet traffic being streaming, TTFB has emerged as a significant measure of performance. It’s no secret that a slow TTFB degrades customer experience and can lead to significant churn. As this blog post demonstrates, objects served from MinIO enjoy a low TTFB regardless of size, offset and concurrency.

Come talk to us (hello@min.io) or download MinIO and try it yourself. You’ll discover why some of the largest streaming companies in the world standardize their infrastructure on MinIO.

Previous Post Next Post