Depending on your threat model combining compression and encryption can introduce security issues. If you want to compress data before encrypting it you should ensure that you don’t leak information through a compression-ratio side channel. If you’re not sure whether you’re leaking information through compression do not compress your plaintext data before encrypting it.
As an S3 object storage, Minio offers you an S3-compatible API for storing your data. But furthermore, we try to provide features you would expect from an enterprise-grade data storage system/solution — like secure data encryption and data compression. However, sometimes if you combine features things can go horribly wrong in an unexpected way. In this post, I’m going to describe when and why combining compression and encryption is not a good idea, in which cases it may be still okay and why we decided to not compress encrypted data.
First, we have to define how our threat model looks like and what we expect from encryption. Our basics model looks like this:
Here a client up/downloads objects to the minio server through the SSE-C S3 API over a TLS/HTTPS connection. The server takes the client-provided key and the object and stores an encrypted object on its storage backend. Now the attacker enters the stage. For now, we assume that our attacker can observe the network traffic between the client and the server as well as between the server and the storage backend (if any) and has full access to the object data at the storage backend. We call this attacker a passive one because she does not interact with the client or influences any data sent from the client to the server. However, the attacker has no access to the client’s or the server’s internal memory and does not know the plaintext object nor the SSE-C client key.
Now we specify what we expect from encryption. First, as long as you don’t know the encryption key you should not be able to learn anything from an encrypted object except its size and the algorithm used to encrypt it. This means in particular that an encrypted object does not leak any information about its plaintext. Second, we expect from a proper encryption scheme that an attacker cannot modify the encrypted object without us detecting this during decryption. We’re achieving this by using modern authenticated encryption schemes.
After defining our assumptions about the overall systems I’m going to show two different types of attacks that use the information leaked through a compression-ratio side-channel. Both attacks enable the attacker to learn something about the encrypted objects which she wouldn’t be able to learn without compression and violate our security assumptions.
Detecting unequal but equal-sized objects
The first attack uses the compression to distinguish whether two objects of the same size contain the same plaintext data. Therefore we assume that the server compresses all objects with a deterministic compression algorithm. This means that given the same input data the compression produces the same compressed output again — which is the case for all compression algorithms I’m aware of.
Now imagine the client uploads two objects to the server. Both objects have the same size — for example 5 MB. The attacker who observes the network traffic cannot learn the content of both objects because of TLS but she can approximate or even determine the (plaintext) object size by analyzing the traffic. The server then compresses & encrypts the object and stores the compressed & encrypted data at the storage backend. Now there are two possible cases:
- Both compressed & encrypted objects have the same size. This means that their plaintext data was equal or that the compression of two different plaintext objects produced compressed data of the same size. So we cannot make any statement about the objects — they are either equal or different, the attacker does not know.
- The size of both compressed & encrypted objects differ. Then the attacker knows immediately that the plaintext of both objects was different. If they would be equal than the compression algorithm would compress them to data of the same size — since we know that the plaintext size of the object was equal and the compression is deterministic. So by contradiction, the object plaintexts must be different.
But this already breaks our security. An attacker is able to learn something about (some) encrypted objects — whether they contain equal plaintexts — without breaking the encryption scheme. Therefore the attacker uses an additional piece of information: The difference between the plain size and the compressed size of an object. This is the information leaked through the compression-ratio side-channel we created by using compression.
So we can see already that compression before encryption weakens the security in our model. However, someone might argue that this attack has hardly any impact for real-world scenarios. Distinguishing whether some encrypted objects are equal or not wouldn’t make any difference most of the time, right?
Well, so far we only considered a passive attacker who only observes network traffic and does not interact with the S3 client or the minio server. But by slightly changing our setup we introduce an active attacker who can run a more powerful attack.
In this setup, the S3 client provides a service to users — including our attacker who may be a malicious user. The client takes some input from the user, combines the user’s data with some data which is only known by the client (not the users) and stores the combined data as a compressed and encrypted object at the server. So we changed our model such that the attacker has control about some parts of the object’s plaintext by interacting with the S3 client. Of course, this is just a minimal example. There are many variations of this demo-setup possible which reflect real-world use cases more closely. The important point here is that the attacker can influence some parts of compressed and encrypted objects in addition to its monitoring capabilities.
Full/partial plaintext recovery
The second attack demonstrates how an active attacker can recover some or even the entire plaintext of an encrypted object by exploiting the compression-ratio side-channel. So the attacker’s goal is to learn some information about the plaintext object part not controlled by her already.
For simplicity let’s assume that users — and also our attacker — can upload an arbitrary amount of data to the S3 client which appends this data to its own 64-byte long string and uploads the combined data as an object to the server. Further, we assume that the client’s data consists of the same byte — for example 64-times
0xA0. However, our attacker has no information about the client’s data in advance. So if the server compresses — for example using the snappy algorithm — and then encrypts the object the attacker can launch the following attack to recover the entire 64-bytes long client string:
- The attacker uploads no / empty data to the S3 client. This causes the client to create a compressed and encrypted object containing only the client’s data. In our example, the client sends
64 x 0xA0an object payload to the server over a secure TLS connection. Then the server compresses that data to a 6 bytes long compressed plaintext using snappy and encrypts this 6 bytes before writing it to the storage backend.
- Through network traffic analysis the attacker learned that the object’s plaintext is 64 bytes long — even though she has no clue about its content because of TLS. Further, she can recover from looking at the encrypted object that the ciphertext is 6 bytes long. Now the attacker does some experiments with snappy and concludes that if snappy can compresses a 64-bytes string to 6 bytes than the 64-bytes string must contain 64-times the same byte. So the compression already revealed that the client data contains 64-times the same byte. However, our attacker does not know which one.
- Now the attacker uses the S3 client service again and uploads 1-byte long data — for example
0x00. Again the client appends this data to its own and lets the server create another compressed and encrypted object. Our attacker will observe that the object payload is now 65 bytes long and that the object at the backend contains 7 encrypted bytes. She can now conclude that snappy was not able to compress her additional byte (
0x00) into the client’s data — otherwise, the object would only contain 6 encrypted bytes. She basically repeats this 3. step while incrementing her byte (next she uses
0x02, … ) until she sees an object which contains 6 encrypted bytes at the backend. As soon as she observes such an object she knows that snappy was able to compress her byte into the client’s data and following that her byte is equal to all bytes in the client’s 64-bytes string.
That way the attacker recovered the entire content of the object without breaking the encryption scheme and only by using the information leaked through the compression-ratio side-channel. Of course, this attack only works that nicely because I’ve made some simplifying assumptions like the client’s data contains only the same byte — in particular
0xA0. Nevertheless, this attack works in many practical scenarios too but our attacker may have to do more tries, use more sophisticated methods and may only recover parts of the plaintext. For example, the CRIME attack against TLS is an advanced version of this attack and can be used to steal HTTPS cookies if TLS compression is turned on.
Can we fix this?
To eliminate this kind of attacks completely we have to disable compression for encrypted content. More precisely we must not compress plaintext data before encrypting it. Compressing the data after encrypting it would be secure but also not very effective. A secure encryption scheme produces ciphertexts which are indistinguishable from a truly-random bit string such that the compression cannot use any redundancy of the data to compress it. So no compression algorithm should be able to compress the output of a secure encryption scheme significantly.
So is compression + encryption completely broken?
Well, it depends on your threat model. For a passive adversary, it would be sufficient if not the minio server but the S3 client would compress the object data. Then the passive attacker cannot observe the original plaintext size through network analysis and cannot take any advantage from a compression-ratio side-channel — because there is none. So it must be ensured that the compression-ratio — the difference between the original data size and the compressed data size — is not revealed to the attacker.
For an active attacker, the situation is a bit more complicated. Since she has some control over the data before it is compressed and encrypted she may be able to still learn something about the remaining plaintext even though she cannot observe the original plaintext size. For example, she can always observe how her input influences the overall compressed data. By varying the data part controlled by her she may be able to extract some information about the unknown part. So it is not possible to completely prevent compression-ratio attacks in the active adversary model. However, they might be not effective enough to give an attacker a significant advantage but this requires some analysis of the specific scenario.
How Minio will handle compression and encryption
We at Minio are trying our best to offer you strong security guarantees for data availability/integrity using erasure coding as well as confidentiality and authenticity using authenticated encryption. Therefore we are not going to compress any data which should be encrypted at the minio server. Compressing encrypting data just doesn’t make sense and compressing plaintext data would compromise your security. So if you want to compress encrypted objects you have to do it at your S3 client before uploading it to the server.
We hope that this post is not just useful for our users but also helps other teams & projects to find the right approach for their scenarios. Of course, feedback or questions are always welcome!