19

Let's take MD5 for example:

It outputs a 128-bit hash. Does it make sense (in theory) to choose an input (password) which is itself longer than 128-bit?

Does it increase the probability of a collision in any way?

I know that MD5 is broken, so what about more modern algorithms like bcrypt or scrypt?

ComFreek
  • 294
  • 3
  • 14

3 Answers3

41

If you consider a set of potential passwords of size P, with a hash function with N possible output values, then the probability that there exists at least one collision in this set is quite low when P is less than the square root of N, and quite high beyond. See the birthday problem. As the Wikipedia page says, that probability is approximately equal to 1-e-P2/2·N.

With figures: if you use passwords consisting of 10 characters (uppercase letters, lowercase letters and digits), then P = 6210 = 839299365868340224; with a 128-bit hash function, N = 2128. The formula above means that there may exist at least one colliding pair among all these passwords with probability close to 0.1%. On the other hand, if you add one character to your passwords (i.e. the potential passwords have length 11, not 10), then the probability that there exists at least one collision rises to 98.1%.

Now all of this is about the probability of existence of a collision; not probability of hitting a collision.

Collisions are not relevant to password hashing. Password hashing works on preimage resistance: given the hash, how hard or easy it is to guess a corresponding password. Note that I said "a", not "the": for the attacker, it does not matter whether he finds the same password as the user did choose; he just wants a password which grants access, and any password which matches the hash output will do the trick.

Note that while MD5 is "broken" for collisions, it is not so for preimages (well, for preimages it is "slightly dented", but not significantly for the purposes of this question).

There are two ways to break preimage resistance:

  1. Guess the password. This means trying all potential passwords until a/the right one is found. If there are P possible passwords with uniform probability, then this has cost at most P/2 because the user did choose one of the passwords, and the attacker will need, on average, to try half of them before hitting that exact password.

  2. Get lucky. Try passwords (random, consecutive... it does not matter) until a matching hash value is found. This has average cost N/2.

The password hashing strength will be no more than the lower of the two. In that sense, using a set of possible passwords which is larger than the output of the hash function (e.g. P > 2128 for a 128-bit hash function) does not bring additional security, because beyond that point, the "get lucky" attack becomes a better bargain for the attacker than the "guess the password" attack, and the "get lucky" attack does not depend on how the user actually chooses his password. Please note that I say "size of the set of passwords" and NOT "password length". All the analysis above is based on how many password values could have been chosen, with uniform probability. If you use only 200-letter passwords, but you may only choose ten thousands of them (e.g. because each "password" is a sentence from your favourite book, and the attacker knows that book), then the size of the set of potential passwords is 10000, not 62200.

In practice, P is bounded by the user's brain (the user has to remember the password) and is invariably lower than N. A "very strong" password is a password from a selection process which uses a P of 280 or more; that's sufficient for security, and yet far below the 2128 of MD5 or the 2192 of bcrypt. But it seems unrealistic to expect average users to choose very strong passwords on average. Instead, we must cope with weak passwords, with P around 230 or so (meaning: try one billion possible passwords, and you'll have broken the passwords of half your users). Mitigation measures are then slow hashing (make each guess expensive) and salts (don't allow the attacker to parallel-attack several passwords at reduced cost). See this answer.

Tom Leek
  • 170,038
  • 29
  • 342
  • 480
  • 17
    Damnit. Every time I post a half-reasonable answer to a crypto question, you come along and blow it out of the water. Excellent work, sir. You git. – Polynomial Sep 10 '13 at 14:21
  • Excellent answer, many thanks! BTW, I already came along other answers from "bears", it must be that they produce high-quality answers :) – ComFreek Sep 10 '13 at 14:46
  • 3
    @ComFreek http://meta.security.stackexchange.com/a/884/10211 –  Sep 10 '13 at 15:08
  • +1 for the quality of this (and many other) posts! I started to read it and thought "I'm sure this is one of the bears!"... Keep up the good work, and thanks for taking the time to distill knowledge in a clear, concise, global, and elegant way! – Olivier Dulac Sep 11 '13 at 09:07
7

Hashing reduces an infinite space, i.e. possible data inputs, to a finite space, i.e. possible hashes. Therefore there will always be collisions.

Technically, if you restrict your input set to a size less than 2h, where h is the size of your hash output in bits, then you decrease your chance of a collision. In fact, as len(m) tends to h, the probability of a collision when exhaustively hashing all values in the set M tends to 1.

That being said, given a large enough value of h, exhaustively hashing all M is highly impractical - for SHA256 you'd need to perform 2255 operations before hitting a 50% chance of collision with a pre-chosen value.

The important thing to remember is that, for a string longer than h, your security is never less than a h-bit message, assuming there are no specific vulnerabilities in the hash that cause multi-block messages to be less secure.

So to be blunt: yes, a message shorter than your hash output length does, statistically, have a lower chance of collision, but the number of operations required to find that collision scales with length.

Polynomial
  • 133,763
  • 43
  • 302
  • 380
0

Yes, it absolutely makes sense to pick passwords longer than the hash output size. Why?

  1. Password hashing algorithms are designed to produce output that is computationally indistinguishable from uniform randomness. From the point of view of an attacker who steals your password database, the password tags look like random byte sequences, none more likely than any other.
  2. Plaintext passwords are very much non-uniform—some byte sequences are massively more likely to be a password than some others.

What this tells you is that the entropy of a typical n-byte password is much, much lower than the entropy that can be encoded into an n-byte password tag. In simpler terms, plaintext passwords are very predictable, and so a real-life database of plaintext n-byte passwords can easily be compressed to fewer than n bytes per password. So in a sense, the n-byte passwords are not using up all the capacity that the n-byte hash provides.

Luis Casillas
  • 10,361
  • 2
  • 28
  • 42
  • Yes, but if 1234 and bLit*2Nps!E&6EqlbKGYS1B4LjgU$IG9парольtUZfV%n6i908Hy58*^&zMljiBX2 hash the same, does it really help to have piked the big one? – Brad Werth Sep 16 '16 at 06:09
  • That "if" has a really unlikely antecedent. It not only requires you to stumble into a long, seemingly entropic password that collides with a tiny, unentropic one, it requires you to do so for *salt value that you do not choose*. – Luis Casillas Sep 16 '16 at 07:11