10

This is not for passwords. I understand that MD5 and SHA-512, etc... are insecure because they can have collisions.

However, is it still possible to have a collision if the string length is less than the hash size (i.e. MD5 is a 32 character hash)? So if the string is less than 32 characters, is it possible to still have collisions?

For example, to authenticate, I was thinking it would check length of password and that it matches say the MD5/SHA password hash.

So take the MD5 hash for the input: somestring.

The MD5 hash is: 1f129c42de5e4f043cbd88ff6360486f

The input string is 10 digits long. So if I wanted to check it was valid, my parameters would be (pseudo code):

if MD5(“somestring”) = 1f129c42de5e4f043cbd88ff6360486f AND string length = 10, then authenticate string. Else it is not authenticated.

However, this only works if no other string that is 10 characters long equals the MD5 hash of: 1f129c42de5e4f043cbd88ff6360486f

Can shorter strings of the same length collide with MD5 or SHA?

If yes, would they still collide if they were only alpha numeric? What could an an example of this be?

Seth Knorr
  • 109
  • 1
  • 4
  • 18
    *"md5 is a 32 character hash"* - No. MD5 is a 128 bit hash, i.e. __16__ bytes binary. A common representation of these 128 bit is as 32 hexadecimal characters. But it could also be represented as a 24 character base64 string and there are surely other representations possible. – Steffen Ullrich Mar 08 '21 at 05:56
  • 22
    "I understand that md5 and sha512, etc... are insecure because they can have collisions." – No, that is not why they are insecure. And all hashes always have collisions, that is trivially true for every hash. – Jörg W Mittag Mar 08 '21 at 15:27
  • @pipe: You are right. I've misread the question. – Steffen Ullrich Mar 09 '21 at 06:03
  • 1
    Forget about checking the length. But: use a random salt, and use a better (slower) algorithm. See [In 2018, what is the recommended hash to store passwords: bcrypt, scrypt, Argon2?](https://security.stackexchange.com/questions/193351/in-2018-what-is-the-recommended-hash-to-store-passwords-bcrypt-scrypt-argon2) for the current recommendations (Argon2, scrypt) and [How to securely hash passwords?](https://security.stackexchange.com/questions/211/how-to-securely-hash-passwords) for a very thorough and detailed description (with the older recommendations of crypt and PBKDF2). – jcaron Mar 09 '21 at 17:25
  • 3
    @jcaron The first line is literally "this is not for passwords". –  Mar 09 '21 at 22:34
  • 2
    @MechMK1 The problem is that line is contradicted by the rest of the question. E.g.: "For example, to authenticate, I was thinking it would check length of password and that it matches say the md5/sha password hash." – Jon Bentley Mar 10 '21 at 12:15
  • Ange Albertine compiled interesting findings about this topic for a presentation ([slides](https://speakerdeck.com/ange/kill-md5-scs), [a video](https://www.youtube.com/watch?v=JXazRQ0APpI)) including examples to yield collisions for clearly different files, or collisions despite hash functions used a salt. – Buttonwood Mar 11 '21 at 15:31

4 Answers4

37

Short answer: collisions don't matter for password verification. The length of the input is irrelevant.

I understand that md5 and sha512, etc... are insecure because they can have collisions.

No, this is wrong. MD5 and SHA-1 are insecure because it is possible in practice to find collisions.

SHA-512 and the other SHA2 variants (SHA-256, SHA-384, etc.) have collisions. We know this by applying a very simple mathematical theorem, the pigeonhole principle. A SHA-512 hash is a 64-byte string. There are strictly more strings of 0 through 64 bytes than strings of exactly 64 bytes. Therefore there exist two distinct strings of at most 64 bytes that have the same SHA-512 hash. This is a non-constructive proof. We have absolutely no clue how to actually find two such strings. SHA-512 is secure because cryptographers have tried to find two such strings for years and haven't gotten any closer. This is the standard for a cryptographic algorithm being “secure”.

This is not for passwords.

Indeed. For passwords, collisions are irrelevant. If there are multiple valid passwords, it doesn't matter. What matters is that an attacker cannot find any of the valid passwords. SHA2 is as insecure as MD5 and SHA-1 for password hashing: password hashing must be slow and salted. It's an unfortunate historical accident that the word “password hashing” includes “hash”: the two have very different properties.

However, is it still possible to have a collision if the string length is less that the hash size. Ie. md5 is a 32 character hash. So if the string is less than 32 characters, is it possible to still have collisions?

Maybe. We don't know. We know that there are collisions among strings that are at most the size of the hash, by the pigeonhole principle, as I wrote above. We don't know whether there are collisions among strings that are exactly the size of the hash: it's possible that all of them map to a different image. Mathematically, that would mean that the hash, with an n-bit output, is a permutation on strings of length n bits. This is extremely unlikely since the proportion of functions on n-bit strings that are permutation is extremely small, but we don't know for sure.

I don't know, without doing more math than I care to at this time, whether it's likely that there are collisions amongst strings that are strictly less than the size of the hash. I won't do the math because it doesn't matter.

By the way, MD5 is actually a 16-byte hash. It's 32 characters long if you write those bytes in hexadecimal, but most 32-character strings are not the hexadecimal representation of anything, so counting 32-character strings is irrelevant.

However, this only works if no other string that is 10 characters long equals the md5 hash of: 1f129c42de5e4f043cbd88ff6360486f

Wrong. This only works if it is impossible to find another string that is 10 characters long and has the same hash. If another string exists but nobody can find it, that's good enough.

MD5, SHA-1 and SHA2 all use the Merkle-Damgård construction. They work by processing the input one block at a time. For each input block, they apply a compression function (a different one for each MD or SHA variant) to the current internal state and the input block to produce the next internal state. At the end, they append some padding to the last partial block and apply the compression function one more time (or more depending on the exact length of the input). The techniques for finding collisions for MD5 and SHA-1 attack the compression function: they are based on finding block content B1 and B2 such that two distinct internal states S1 and S2 map to the same resulting internal state S'. The techniques allow very little control over the content of B1 and B2, so they don't allow creating collisions where the input is less than one block long, perhaps minus a few bits at the end which can be brute-forced to look like the valid start of padding. (In fact, as far as I know, it takes two blocks of input to reach a collision.) So we don't know how to find collisions on strings that are shorter than the block size. Both MD5 and SHA-1 have a 64-byte block, so there is no way to find a collision where one of the strings is shorter than about 60 bytes.

Of course, even if the current attack techniques don't allow finding a collision on short input, it would be stupid to use a known broken hash function. Attacks only ever get better.

Gilles 'SO- stop being evil'
  • 51,415
  • 13
  • 121
  • 180
  • 1
    According to this answer two chinese researches found a 64byte collision in 2013 - so it is not unlikely there are even shorter possible collisions. If you string can contain unicode special characters a 10 character string could as well contain 40+ bytes (especially if an attacker can control the contents) - besides being theoretically possible, this starts looking almost feasible. – Falco Mar 09 '21 at 09:50
10

None of those hash functions were designed to have this property. You are asking for an injective secure hash-function f(A)->A and this is a mathematically very difficult task. And nobody expects/needs this property.

Anyway, to be 100% correct, since nobody has ever found such sha512 collision, the answer to your question should rather be, we don't know. But it's like if you asked me if it has ever rained in Seattle. Actually, I'm sure it rains there a lot. But I've never been there, so from the mathematician's perspective, I would have to say that I don't know.

And it's the same with those short collisions. I'm sure, they exist like I'm sure it rains in Seattle.

smrt28
  • 875
  • 6
  • 12
5

MD5 produces a 128-bit hash value. Thanks to the birthday paradox, that means we can expect to find a collision after searching on the order of 2^64 input values. Which means there's a good chance of a collision existing between two 8-byte inputs. If the inputs are restricted to printable ASCII strings, it's going to take about 10 characters of input before we expect to start to see collisions.

So it's certainly possible for collisions to exist between strings shorter than 32 characters. In fact, we expect a huge number of them to exist. It's even possible (though unlikely) there exist collisions between 9-or-less-character ASCII strings.

(I don't know of any specific examples of short collisions. AIUI there was a brute-force search for MD5 collisions running for a while, but it was cancelled when non-brute-force methods beat them to the finish line.)

Also: I'm not really clear on your intended application here. You say you're not using this for passwords, but you seem to be talking about "authenticating" short text strings... which is likely to have many of the same security constraints as password hashing. Depending on what you're actually doing, a password hash (e.g. Argon2) might be the best fit.

Gordon Davisson
  • 2,601
  • 1
  • 17
  • 13
2

Most of the answers here address the actual question that was asked:

Can shorter strings of the same length collide with md5 or sha?

And the answer is "absolutely they can" and even more precicely it is mathematically guaranteed that there're collisions if the input size is from 0 to N where N is the hash output size.

However, it should also be mentioned that if you save/persist the plaintext password length into database/storage-medium where the hash is, you're actually making bruteforce attack easier. So instead of needing to enumerate all the password sizes up to a certain length they can just focus on the right length from the beginning. They could also group them by the length so they can attack all the same length passwords at the same time.

So from security perspective saving the plaintext password length and then comparing also that in addition to the hash, actually decreases the total security.

Puzzled
  • 121
  • 2
  • I think the question is whether there exist collisions with lengths shorter than N. – Paŭlo Ebermann Mar 09 '21 at 23:22
  • @PaŭloEbermann that's true, and it has been already answered. However, since this is "security" exchange, it is also good to address the security implications of the larger proposed scenario. Name that it's not a good idea to store the plaintext password length into database. And since it's not a good idea, it's also not a good idea to compare the plaintext password length in addition to the hash. This also includes other forms of authenticating data, but it would be ok to check length for transmission errors. – Puzzled Mar 10 '21 at 14:34