Naming encrypted files in the cloud, without leaking info about their content

Question

I'm working on a photo sharing app that stores files in the cloud, encrypted, and lets you share those files with others. To do that, as suggested in answers to this previous question:

files are encrypted locally with a (random) symmetric key, then uploaded
then this first key is encrypted for each recipient using a shared key, derived using the recipient's public key and a key exchange protocol (diffie-hellman). Those encrypted recipient keys are stored on the cloud along the file.

Now, so far, I've stored files named according to their hash (the plaintext's hash). That lets anyone quickly check if they've already uploaded/downloaded this file. But that also lets an attacker quickly check if you've uploaded a specific forbidden/sensible file.

Is there a way to name files / choose their location, that would solve this problem ?
And still satisfy these properties:

given an uploader's public key, two identical files will upload to the same name/location (deduplication)
The uploader can quickly find a previously uploaded file (given the bytes of the files) and avoid re-uploading
Any recipient can also quickly find a file it has access to, and quickly check if they've already downloaded it (to avoid re-downloading)
given the bytes of a file to check, an attacker cannot quickly check if an uploader has uploaded this specific file

p.s.

I'm working with existing cloud storage, without any centrally managed users database or server hosted by me, if possible. Essentially I'm treating this cloud storage as completely public storage, accessible by anyone.

Here's what I attempted:

using the ciphertext's hash as filename (without keeping records of previous uploads)
In that case, since encryption keys are random, the ciphertext hash changes for the same file encrypted twice. So I lose the ability to quickly check if a file is already uploaded (unless I kept track

Here's what I didn't attempt yet:

using the ciphertext's hash as filename (and keeping records of previous uploads). Could work, but it looks like the recipients would have to keep records of the downloaded files too (re The recipient can quickly check if they've already downloaded a file). it seems this gets a bit trickier when one user loses these records and needs to rebuild this index, it seems they'd have to re-download every file and match it against local files.
using the ciphertext's hash as filename, and using deterministic keys instead of random ones (keys could be derived from the file bytes and uploader's pubkey for instance). Would this be viable, given the keys are not random anymore ? Also, the recipient would have to keep records of downloaded files, right ?

wrt `But that also lets an attacker quickly check if you've uploaded a specific forbidden/sensible file.` - You might want to consider designing the system such that it does not disclose to the user whether or not the file exists, unless authorization to access this file has been granted to this user's public key, and this user proves that he is in possession of the private key associated with this public key. — mti2935, Oct 16 '21 at 15:07
I've edited the question to mention I'm working with public cloud storage, with no user authorization server if possible. Does that exclude your suggestion ? Or it's still possible and I simply don't see how ? — Nicolas Marshall, Oct 16 '21 at 16:41
Thank you for clarifying the question. WRT, `I've stored files named according to their hash` - are you naming the files according to the hash of the plaintext, or according to the hash of the ciphertext? Also, see https://security.stackexchange.com/questions/256033/authenticationless-end-to-end-encrypted-server for some things to consider when making the encrypted files publicly available. — mti2935, Oct 16 '21 at 18:57
I meant the plaintext hash, I've edited the question to be clearer. I first tried with the ciphertext hash, only to realise the uploads were named differently every time (for the same source file). And that I would need to keep local records of uploaded files for this to work. — Nicolas Marshall, Oct 23 '21 at 18:10
Also, thanks for the link. The first storage I'm using is S3-compatible, so has some form of access control (managed by a company, not me). But is not fine grained with my own rules per user. I was basically planning to share access keys to a given photo album between multiple users. Not very secure. And I'd love to make this also work on unprotected ftp servers in the future. — Nicolas Marshall, Oct 23 '21 at 18:14
Thanks for the clarification. In that case, how does the server get the plaintext hash, if all it sees is the ciphertext? Does the system rely on the user to provide the correct plaintext hash when the encrypted file is uploaded? — mti2935, Oct 23 '21 at 18:19
yes. The filename is chosen client-side with S3, and for now, the program sets it to be the plaintext's hash. — Nicolas Marshall, Oct 23 '21 at 18:22

mti2935 · Accepted Answer · 2021-10-24T01:15:10.960

Thanks for taking the time to work out some of these details in the comments, and for editing the question to append the additional information. It seems that ideally, the server should not know any information about the plaintext file - including the hash of the plaintext, the original filename, etc.

You might want to consider a scheme where the encrypted file is stored in the s3 bucket with some random filename (e.g. randomfilename.enc). Then, along with this file, is another file (randomfilename-metadata.enc), perhaps in json format, containing metadata about the file (including the plaintext hash, the original filename, etc.), which is also encrypted using the same symmetric key that was used to encrypt the randomfilename.enc. This way, anyone that has the symmetric key (including the user that uploaded the file, and all other users that he/she has shared the file with) can decrypt randomfilename-metadata.enc, and get the hash of the plaintext file, and check if they've already downloaded or uploaded the file. But, users that do not have the symmetric key will not be privy to the hash of the plaintext file.

This seems like it would do all I need. Thank you, I'll try implementing that — Nicolas Marshall, Dec 10 '21 at 11:17

Manish Adhikari · Answer 2 · 2021-10-26T03:57:47.760

If I understand correctly you need to：

Looking only at the content of the plaintext, you need to know whether the encrypted form of the file has been uploaded.
Key needs to be shared with recipients such that, they can compare the uploaded content with the plaintext they have to check whether they have the file
An attacker with access to plaintext cannot tell whether any of the uploaded file matches it.

So here is one way. I think it can meet above three requirements. Start with one master key MK. Each person owns the master key and it's their job to protect it. Encrypting it with a password derived key (using good PBKDF) can be a way. Each uploaded file will have an ID. The ID can be pretty meaningless if you want privacy and it does not need to be random or unpredictable. Just unique for each file will do. Derive a file key FK from the master key using a KDF. HKDF-expand will do. So if FID1, FID2 etc are file ids that FK1=HKDF-expand(MK,FID1), FK2 = HKDF-expand(MK,FID2) etc. This file key is what will be shared with recipients. Now you will similarly derive a few keys from each file key, FK. Just how many depends on your algorithm but we will look at two. File Identification key FIDK = HKDF-expand(FK,"fidk") and Encryption Key EK = HKDF-expand(FK,"ek") or something like that. Assuming that HKDF-expand is a good pseudo random function,you can simply use FIDK xor H(contents) as your file identifier (FIDR). Don't worry, simple xor with H(file contents) is safe to use with a PRF. I am doing simple xor because you can now just hash the file contents once and compare with many file identifiers.

Now on to details. I suggest you use an AEAD algorithm like AES-GCM or Chacha20-poly1305. Your FID and FIDR is uploaded along with the content in plain text and you can use them as your associated data. If you want to use modes like CBC then IV needs to be unpredictable and you will need a separate MAC key so you derive them from FK as well. For AES-GCM or Chacha20-poly1305 however, IV being unique for each use of a single key is enough and no separate MAC is required. The recipient can used the FK shared with them to calculate FIDR for each file and compare it to the uploaded one. The uploader can derive each FK from the master key and FID. An the attacker with no access to key has nothing to compare to.

This does not prevent tampering with the data by those with whom the key has been shared. It needs to be enforced some other way like uploading to distributed ledger like blockchain, or enforced by the file hosting server. There is option of using digital signatures but there is deniability issue with using long term digital signatures for that. One workaround can be using one signature key pair for each file and signing the file wise signature public key with long term (PKI backed) signature private key. If some issue arises just publishing the file wise signature private key to the recipient can help with deniability. Moreover, this way we can distinguish between read only access in which just the FK is provided and the read/write access in which the file wise signature private key is also provided. EDIT: If you do that you might want to sign the FID,file signature public key and a commitment to the file key together, the last one to use in case someone tries to cheat by moving ID from one file to another. Signing FIDR, you will lost your plausible deniability

score 0 · Answer 3 · answered Oct 24 '21 at 03:03

You can't get both cross-uploader deduplication and decentralization. The only difference between an uploader trying to determine if some other uploader has uploaded a file and an attacker trying to determine if someone has uploaded a file is motivation.

You can get the rest of your desired properties, though:

The uploader picks a random filename. Given a long enough filename and a good random-number generator, this will prevent anyone from guessing the name, and will prevent collisions from other uploaders.
The uploader generates a random symmetric key and encrypts the file with it.
The uploader computes the hash of the plaintext file, and stores this, the symmetric key, and the random filename locally. This prevents re-uploading (at least, from the same device).
The uploader uploads the file to shared storage.
The uploader encrypts the filename, the hash, and the symmetric key using an appropriate per-recipient key and sends the result to the recipient.

A recipient can easily check if they've got a file by comparing the hash to the list of hashes they've already received. An attacker, on the other hand, can't check shared storage to see if a given file is present, and can't get file hashes without compromising either the uploader's computer or a recipient's computer.

Naming encrypted files in the cloud, without leaking info about their content

3 Answers3