Does knowing how an encrypted file changed make it vulnerable?

Question

Scenario

The following bash commands create an empty file test.txt, encrypt it using a default algorithm to test1.gpg, then append the line new line to the original file and encrypt it again to test2.gpg. Each of the gpg commands prompts the user to enter an encryption key, which is not printed to the terminal.

$ touch test.txt
$ gpg --output test1.gpg -c test.txt
$ echo "new line" >> test.txt
$ gpg --output test2.gpg -c test.txt
$ ls
test1.gpg  test2.gpg  test.txt

Suppose that an attacker has obtained the encrypted files test1.gpg and test2.gpg, and also has obtained access to my shell history—that is, they know exactly how the decrypted versions of these files differ, namely, by the addition of the line new line.

Furthermore, suppose that the attacker knows (or guesses) that I entered the same encryption key at the prompt both times, but does not know the value of the key itself.

Is this information enough to enable the attacker to recover test.txt in its entirety?

Scope of this question

I am interested if encryption algorithms in general are designed to anticipate this sort of attack, but if there are differences among the various encryption algorithms out there, you may note that in your answer.

Alternatively, you may respond with respect to the default algorithm invoked by the shell commands above, which is described in man gpg as follows:

       -c     Encrypt with a symmetric cipher using a passphrase. The  default  symmetric  cipher
              used  is AES-128, but may be chosen with the --cipher-algo option. ...

Why I think this question has merit

Many cloud storage services offer some form of encryption, but there are also many common file-manipulation tasks, such as as updating a date, that could enable an attacker to guess exactly what changes were made.

Follow-up questions

Does the vulnerability change depending on the type of edit, e.g. appending versus prepending versus modifying content in the middle of the file versus something else?
What if many sample edits are available—say, 10, 10 thousand, or a million?

Additional info

$ gpg --version
gpg (GnuPG) 2.2.19
libgcrypt 1.8.5
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Home: ~/.gnupg
Supported algorithms:
Pubkey: RSA, ELG, DSA, ECDH, ECDSA, EDDSA
Cipher: IDEA, 3DES, CAST5, BLOWFISH, AES, AES192, AES256, TWOFISH,
        CAMELLIA128, CAMELLIA192, CAMELLIA256
Hash: SHA1, RIPEMD160, SHA256, SHA384, SHA512, SHA224
Compression: Uncompressed, ZIP, ZLIB, BZIP2

the devil is in the details! can you please edit your post to add information from `gpg --version`, as this will likely be relevant to any answer that talks to the default behaviour of this tool ; also, are your requirements for file-modification strictly "append only" (ie. `>>`)? (depending on the class and behaviour of cipher this would likely have an impact as well) — brynk, Jun 27 '22 at 03:54
Done. As noted in the "scope of this question" section, what I am really looking for is a general summary of when this type of attack is viable (to include types of edits, encryption algorithm, number of edits available, etc.). However, if such an exhaustive response is impractical, I am also willing to accept one that deals with the exact setup described. — Max, Jun 27 '22 at 04:15
What you describe is named by cryptographers as the "IND-CPA" property. Resistance to such attacks is a requirement for modern ciphers. More information here: https://en.wikipedia.org/wiki/Ciphertext_indistinguishability — A. Hersean, Jun 27 '22 at 09:50
Read about the Kerckhoffs's principle https://en.wikipedia.org/wiki/Kerckhoffs%27s_principle. The security of a crytptosystem must only rely on the key. The fact that you used a same key ti encrypt two different texts should not be enough to break a secure cryptosystem. — Maf, Jun 27 '22 at 22:07

score 3 · Accepted Answer · answered Jun 27 '22 at 22:32

The short answer is no, and this is a fundamental property of (non-broken) encryption schemes. Intuitively speaking, if you don't know the decryption key, the only information you can get from a ciphertext is the length of the plaintext. The formal statement of this property is semantic security.

In your scenario, anyone can tell that test.txt is empty, since the ciphertext exposes the length of the plaintext. However, as soon as test.txt is non-empty, there is no way to tell its contents from the ciphertext, even if the attacker knows the ciphertext for related plaintexts. In fact, suppose that the attacker can choose ciphertexts and get any number of ciphertexts decrypted apart from test1.gpg itself (a chosen-ciphertext attack). Even then, the attacker won't be able to know anything about the plaintext of test1.gpg other than its length.

With PGP/GnuPG, you do need to be careful about compression. The encryption itself is secure, but the sequence compression+encryption is not, because the compressibility of the message, and therefore the length of what is encrypted, depends on the contents.

Most encryption schemes that are used in practice are semantically secure. However, there is one exception: disk encryption usually uses ciphers that are not semantically secure, for performance and storage space reasons¹. (XTS is popular.) This is generally ok because the typical threat model for disk encryption does not include chosen-ciphertext attacks, only chosen-plaintext attacks (if even that). Disk encryption is only intended to protect against theft of the storage media, and once the media is stolen it doesn't get used anymore. If an attacker manages to make copies of multiple versions of the ciphertext of a disk sector, they may be able to obtain partial information about its contents.

¹ _{Since the attack model excludes chosen-ciphertext attacks, disk encryption is usually unauthenticated, which saves the space to store authentication tags and the need to check authentication tags when reading back. And since the attack model assumes the attacker only sees a single version of the ciphertext, disk encryption usually reuses the same IV throughout the history of a sector, instead of storing a new IV each time.}

You write that "In your scenario, anyone can tell that `test.txt` is empty," whereas @CBHacking's answer seems to imply that AES-128 pads the input to achieve a round number of blocks. (Note that `cat test1.gpg` produces nonempty output.) If I am understanding correctly, I think the strictly accurate statement is "Anyone can tell that `test.txt` is shorter than 128 bits" (which, of course, is basically nothing). — Max, Jun 27 '22 at 23:56
@Max When using the commands from the OP question, on my computer (gpg 2.2.27) `test1.gpg` is 78 bytes long, while `test2.gpg` is 87 bytes longs, and the string `"A\n"` (2 bytes) gets encrypted to 80 bytes. So GPG leaks the length and Gilles is correct, is is possible to deduce that the first `test.txt` file is empty. — A. Hersean, Jun 28 '22 at 08:47
In this case AES-CFB is used by GPG without padding, avoiding all potential padding attacks. GPG also leaks the name of the encrypted file without having to decrypt the GPG archive. — A. Hersean, Jun 28 '22 at 08:54

score 2 · Answer 2 · answered Jun 27 '22 at 08:03

This question is plausibly better suited to crypto.SE, but there is of course overlap. The short answer is almost certainly "yes, modern ciphers are designed to be secure against all known-plaintext attacks, including modification". In general, a "crib" (a chunk of known text at a known position in the message) isn't enough to meaningfully impact the security of modern ciphers; even if all but one byte of the message is known that last byte should be unguessable. That's true whether or not you know a prior state of the message. However, I admit I don't know of any studies specifically aimed at either proving that modification is safe, or reducing it to a known-safe operation.

It's also worth noting that "encryption algorithms in general" encompasses a lot of things that are not necessarily secure against cribs; cryptography people just always tell people to never use the insecure constructions. People don't always listen (or bother to check with a cryptography person at all) though, of course. For example, any block cipher in ECB mode is potentially vulnerable in a situation like this; if the attacker knows the plaintext of any given block, and any other given block has the same ciphertext, then the attacker knows the plaintext of that other block as well. This is one of the reasons you should never use ECB, but it does exist nonetheless and crops up occasionally in production code.

Also, even with a secure cipher/mode/etc., there's some information leaking here anyhow. For example, encryption doesn't hide the message length very well (sometimes doesn't hide it at all, with e.g. stream ciphers), so knowing that you added 9 bytes (counting the newline) to the file would allow them to learn things about the original length (assuming they can't just immediately tell that it's empty, which sometimes they could). For example, with a 128-bit block (like AES and some other ciphers use), this would not change the message length for the most common padding scheme - it would still be one block - so the attacker would know the original file was at most 6 bytes long (if the new message was a total of 16 bytes or more, there would be a second block even if it's all padding). With a 64-bit block (used by all forms of DES and some other ciphers), it would change the file from one block to two blocks, once again revealing that the original was at most 6 bytes long (which is slightly more info than they knew before, which was that the original was at most 7 bytes long).

brynk · Answer 3 · 2022-06-27T11:41:08.730

Knowing how an encrypted file changed doesn't give up much in deciphering the encrypted data, assuming that modern cryptographic protocols and strong key material are in use. (I'll ignore the obvious leaking of plaintext data in the command-line history!)

As long as the key in use is randomised with high degree of uncertainty (aka high-entropy) the encrypted output produced will be as well. Subsequent messages encrypted with the same key generally add some additional entropy to the "starting position", which leads to different output from the very first byte of encrypted data. This can be called the initialilsation vector (aka IV) or nonce depending on the context, and it must be unique for each discrete key+message pair.

In the event that the IV or nonce is re-used, then there is the potential for outputs to be the same up to the point that the plaintext data starts to change.
In a block cipher that uses some form of chaining, all subsequent data will differ because the earlier blocks feed into the later ones (see ^6).
This is a greater weakness for a streaming cipher or counter-mode block cipher because the XOR of the plaintexts is equal to the XOR of the ciphertexts (see ^4 and ^7).

Having said that, as long as these IV/ nonce values remain unique, even a million re-uses of the key likely won't have an impact on its discoverability, as long as the maximum data limits are adhered to (many, many gigabytes). Be sure to check the limits of the cipher.

WITH RESPECT TO gpg AND ITS DEFAULTS

Your command as listed is reasonably secure because the ciphertext will differ every time the file is re-encrypted (a block of random data, plus two bytes, is prepended each time the output is generated).

In the following modified command I've increased the output space of the hash digest through the use of --Xdigest-algo=SHA512 at all points where this is relevant. (This may break compatibility with other implementations or older versions.)

You could consider embedding a recipient signature in the output, which allows file verification without decryption. I've also disabled compression with --compress-algo=none in this example, if that risk is of a concern (see discussion in ^1).

gpg --sign --output test.gpg -c --compress-algo=none --s2k-mode=3 --s2k-count=65011712 --s2k-digest-algo=SHA512 --digest-algo=SHA512 --cert-digest-algo=SHA512 test.txt

NOTE ON S2K PASSWORD-BASED KDF

Even with the same password, a new random 8-byte salt is generated on each invocation when using --s2k-mode=3 (discussed subsequently) which has the effect of changing the encryption key.

However, the way gpg derives the encryption key from the password is somewhat dated by modern standards, see for eg. Argon2 pwd kdfs. In the modified command I've configured the password-based key derivation function to be as strong as possible allowed in this version via --s2k-X flags (this choice should be reviewed on newer versions). Since this kdf isn't very hard by modern standards your typed password should be as strong as humanly possible (see ^8).

Another option might be to use another mechanism to produce a password string from a harder kdf (eg. using openssl or cryptsetup) then provide this via stdin with the --passphrase-fd=0 option.

MORE READING

Is it safe for GPG to compress all messages prior to encryption by default Marthenal and friends'13 discuss compression as a potential source of vulnerability.
GnuPG symmetric encryption yields different output rdanitz and friends'15 give insight into why the output is always different.
rfc4880 OpenPGP Message Format, linked to Section 9.2 Symmetric-Key Algorithms, and also Section 13.9 OpenPGP CFB Mode
if the IV didn't change as well then the XOR of the two plaintext versions of the same block would be leaked (via the XOR of the two ciphertexts MechMK1 '20).
My very shallow overview (2020) of GoCryptFS: https://security.stackexchange.com/a/245556
https://en.wikipedia.org/wiki/Block_cipher
https://en.wikipedia.org/wiki/Stream_cipher
This answer discusses forcing S2K Leek'16 with the up-shot being that it rests on the strength of the input password, also see the open hashcat issue to add support for GPG/PGP private key

This answer is overly complex and adds a lot of information irrelevant to the question. — A. Hersean, Jun 27 '22 at 10:00
The answer is long, but I don't find it terribly complex, although I recognize that there are portions that (as a novice) I can only understand at a surface level. — Max, Jun 27 '22 at 10:07