19

We all know that we're supposed to take a fairly slow hashing algorithm, salt the password, and run the hash for many iterations. Let's say that I'm following almost everything except for one rule, and I have a static salt. Something like this:

password = 'yaypuppies' + some_static_salt

1000.times do
    password = amazing_hash(password)
end

And now password is a great hashed and salted thing. All is well with the world.

But what if we ran it a whole lot more iterations?

3000000000000000000.times do # 3 quintillion
    password = amazing_hash(password)
end

Would, in theory, many passwords collide? I.e. would this happen?

pass1 -> lkajsdlkajslkjda > 23oiuolekeq > n,mznxc,mnzxc > common_thing > 987123oijd > liasjdlkajsd > 09893oiehd > 09uasodij
pass2 -> loiuoklncas > 9830984cjlas > ioasjdknckauyieuh > common_thing > 987123oijd > liasjdlkajsd > 09893oiehd > 09uasodij

And both passwords end up hashed to 09uasodij?

With a non-randomized-per-password salt, does the chance of a collision go up with every iteration added?

Undo
  • 450
  • 5
  • 14
  • 4
    Hash collisions in modern algorithms are not considered an issue when it comes to password hashing. – Lucas Kauffman Jun 24 '14 at 07:20
  • 1
    You might be interested in how rainbow tables work, as the concept that they are based on is somewhat similar to your misunderstanding of the chains. – PlasmaHH Jun 24 '14 at 12:17
  • This question seems to be asking if it's possible for two different strings in the output space of a hash algorithm to hash to the same value. I would suspect that for most algorithms, this is at least close to true. The is h(x) = h(y) is possible for different x, y if and only if not both x and y are in the output space of h. – Cruncher Jun 24 '14 at 15:39
  • @LucasKauffman yes. Because nobody hashes their passwords a bijillion times. What if we increased that number significantly. What if we (theoretically) went through graham's number of iterations? If any 2 hashes(that is string in the output space of the algorithm) can hash to the same value, then it's very possible that this would result in all identical hashes at the end. – Cruncher Jun 24 '14 at 16:02
  • @Cruncher the point of secure hashing algorithms is that the output is large enough to make it unfeasible to reach the point. By the time we would need to start hashing several billion times a new algorithm will already have been devised. That's the whole point of NIST's Secure Hashing Algorithms program. We aren't using MD5 or MD4 to hash our passwords anymore. You can still argue that you could implement MD5 into PBKDF2, but there would be no reason considering the vast amount of alternatives. Hence collisions aren't considered an issue, because if it is you chose the wrong algorithm. – Lucas Kauffman Jun 24 '14 at 18:24
  • @LucasKauffman This is a very theoretical question. Obviously impractical – Cruncher Jun 24 '14 at 18:57
  • I guess it would be better to move it to crypto then – Lucas Kauffman Jun 24 '14 at 19:04
  • The salt could be used like this: `while i < 3000000000000000000 //3 quintillion hash = amazing_hash(hash . salt . password) end`, that way if the hash of two different passwords collide at some point, the hash for the next iteration will be different – Lie Ryan Jun 25 '14 at 10:17

8 Answers8

23

When iterating a hash function, space reduction occurs, but not down to a single point. For a randomly chosen function (that your "amazing_hash" is supposed to approach), with a n-bit output, you may expect to ultimately reach a cycle of size 2n/2 or so, i.e. still big enough if you use a decent output size (say, n = 256).

See this answer for more detailed explanations. I reproduce here the scheme from that answer, because it is a good eye-catcher:

The "rho" diagram for an iterated hash function

Of course, a "static salt" is not a salt; it just means that you are using a custom hash function. The salt is meant to deter parallel attacks: when the attacker tries to crack 10 passwords, it costs him 10 times the cost of cracking one. With a "static salt", cracking 10 passwords costs no more than cracking 1, i.e. a total failure of the salting.

Salts are not about avoiding collisions, notably because collisions are not a problem for password hashing. It is preimage resistance that you should worry about.

Tom Leek
  • 170,038
  • 29
  • 342
  • 480
  • 7
    Strictly speaking, "static salt" is not quite a *total* failure of the salt. The attacker at least cannot attack your site in parallel with others that use different static salts, or apply a rainbow table off the shelf. – Steve Jessop Jun 24 '14 at 18:46
  • I think a bigger issue is that even if one starts with e.g. a full set of all 2^128 different 128-bit vector and passes them repeatedly through random mappings, the number of collisions will fall off rapidly with the number of different vectors. For example, by the time the number of vectors has fallen to 2^120, each iteration will only lose about 1/256 of them. By the time it has fallen to 2^100, each iteration will only lose about 1/268,000,000. Theoretically, if one used a different random mapping at each iteration, one might "eventually" map all starting vectors to a single value... – supercat Jun 24 '14 at 20:24
  • ...but the rate at which collisions occur would drop off so markedly that by the time one was down to 2^64 values one would only be losing about one possible value (out of 2^64) per iteration, and one wouldn't have to go much beyond that before it started taking on average over a million iterations for each value lost. – supercat Jun 24 '14 at 20:33
10

I don't think so, just because the hash would almost certainly reach "common_thing" at different points. One password might has to "common_thing" at step 10,000, and another at step 100,000. The chains will track each other in parallel, but they won't necessarily be at the same point when the algorithm ends.

Whether the cycles are large or small, the probability is still low. If there are many small cycles, a value is less likely to end up in any particular one of them; if there are a few large cycles, the value is less likely to end the chain at the same point.

As other people have said, the reason you don't use a static salt is to keep attackers from creating rainbow tables. I'm not sure that the static salt has any effect on the number of collisions over time, other than that of course identical values will hash to the same value.

Take all this with a grain of salt, though; I'm a crypto enthusiast, but not an expert. I'd love to hear more about cycles in hash algorithms if other people are more informed in this area.

Neil Fitzgerald
  • 374
  • 1
  • 2
  • 10
  • The suggestion from your post seems to suggest that multiple hashes could eventually reach a "common_think". It would be good to substantiate that with real-world data. – Omer Iqbal Jun 24 '14 at 03:08
  • 1
    I'm not sure what you mean. Suppose you have some chain where h(x) = y, h(y) = z. Then obviously both x and y will hash to z if you repeat h the appropriate number of times. Alternatively, if you mean, can there exist distinct x and y such that h(x) = h(y), that's also a yes. That's a collision, and all hash functions have them, because they're reducing a large space of strings to a smaller one. – Neil Fitzgerald Jun 24 '14 at 03:13
  • Of course. However, the question is how likely are you to end up in such a situation when you were starting with two passwords with two different hashes. SHA256 has 10^77 possibilities and I am struggling how two distinct passwords with two distinct hashes will collide in 10,000 or 100,000 iterations. That would be quite a weak hash function, wouldn't it be? – Omer Iqbal Jun 24 '14 at 03:18
  • 1
    @OmerIqbal They're arbitrary numbers to illustrate a point. Even if they both become "common_thing" at some point in their iterations, they're (likely - I'm not even really an enthusiast, let alone an expert) not going to produce the same result unless it happened at the same iteration in both cases. – Anthony Grist Jun 24 '14 at 09:17
  • I've always believed it to be true that `t` will have more bits of identifiable information than `h(t)`, but I realise that I don't know why I believe that and can't immediately find any literature on the subject. However, if `information(t) >= information(h(t))` in all cases, then eventually you would reach 0 bits of unique information. I am not, however, a cryptographer, I just know enough about the field to know I don't know anything about it. – Phoshi Jun 24 '14 at 09:20
  • 1
    `the hash would reach "common_thing" at different points.` how do you know this? Are you claiming that it's impossible for 2 different hashes to hash to the same value? – Cruncher Jun 24 '14 at 15:42
  • @Cruncher Example: `h(a) => b`, `h(b) => c`, `h(c) => a`; `h(1) => a`, `h(2) => b`; they both hash to the same value but at different iterations. – Izkata Jun 24 '14 at 17:07
  • @Cruncher Sorry, I meant that would probably be the case, not definitely. Izkata's case is more likely than them racing the same value at the same point, although of course it's possible. – Neil Fitzgerald Jun 24 '14 at 17:38
  • @Izkata I get that. I know that it's possible, and obviously more likely. But is it impossible for `h(a)=>b h(b)=>c h(c)=>d` and `h(e)=>f h(f)=>g h(g)=>d`if yes/no, how do you know? That is `h(h(h(a))) = h(h(h(e)))` but `h(a) != h(e)` – Cruncher Jun 24 '14 at 17:52
  • Okay, I see what you're asking! Sorry for the confusion. Well, maybe and maybe not. After the first use of h, you're reducing the input space to the output space. It's possible that every value in the hash space maps to a different value in the hash space, and in that case, you actually can't have this kind of collision. – Neil Fitzgerald Jun 24 '14 at 18:00
  • @NeilFitzgerald Right, that's what I meant by 2 different hashes hashing to the same value(perhaps confusing as I used hash both as a verb, and as a noun to mention something that is in the output space of a hash algorithm). `It's possible that every value in the hash space maps to a different value in the hash space` This is what I'm curious about. Whether or not to get collisions you **have** to venture outside of the output space. Probably algorithm specific, but it seems like it would be a nice property to have. – Cruncher Jun 24 '14 at 18:03
  • 1
    In fact, if you do have such a collision, that means there is some value y in the output space such that there is no x in the output space such that h(x) = y. (Sorry, that's a mouthful.) Which doesn't necessarily prove anything; none of the standard cryptographic properties of hashes (preimage resistance, collision resistance) would confirm or deny that. – Neil Fitzgerald Jun 24 '14 at 18:03
  • Yeah, I guess I don't know about that. I agree that it's interesting, but I don't know enough about the implementation of individual hash functions to answer that one way or the other. Sorry! I'd like to know too. – Neil Fitzgerald Jun 24 '14 at 18:04
  • @NeilFitzgerald hashes which can only be hashed to from outside of the output space sound interesting too! – Cruncher Jun 24 '14 at 18:05
4

The issue related to the static salt, is not that there are increased chances of collision (there is not). Running the same algorithm repeatedly will not lead to an increase in collisions (with a good hashing function) just so long as all passwords have the same number of iterations.

The real issue is a game of odds.

If an attacker knows the code used to generate the hashed password (the mechanism used to inject the salt, and the number of iterations through the hash loop), then the attacker can reproduce your algorithm. Assume that the attacker also knows your actual hashed result. Can they work out your actual password?

With the algorithm, the attacker can then feed a dictionary of common passwords in to the algorithm, and look for matches to your hashed result. Admittedly, it may take a long time, but, eventually the attacker could guess your password.

The thing is, that you may have a 'hard to guess' password.... but, what about everyone else in the system. Your hashed password is just one of many. If the site is 'stack exchange', then there are thousands of users. If they all use the same salt, then, a dictionary attack like this can check for the match against all the users' hashed passwords. If they get a match, then they have also guessed that user's password. It's a numbers game. If there are 10,000 users in a system, then the odds of finding an easy password are improved by 10,000, likely much more than that.

Now, if you use a unique salt for each user, getting a matching hash for a user is useless unless that user also has the same salt you have in your algorithm.

In other words, with a unique salt, you can only attack one user account at a time. With a static salt, you can attack them all at once.... and the chances of a hit are much greater.

rolfl
  • 171
  • 1
  • 4
3

In theory, yes, that will definitely happen, for sufficient definitions of "a whole lot" and "many". In practice, no, that will never happen. The point of a secure cryptographic hash function is that "a whole lot" and "many" are ridiculously large numbers that cannot be reached. If you can reach them, either there's a severe attack against the algorithm, and you shouldn't use it, or it's not big enough, and you shouldn't use it.

For example, take MD5. A hash is 128 bits long. Ideally, on average you should find one collision with 264 tries, or about 18 quintillion. This would be very expensive, but possible to do with enough hardware. However, MD5 has suffered catastrophically from cryptanalysis, and there are attacks that can find collisions in moments.

On the other hand, there's SHA-256. It's 256 bits -- it's not possible to calculate 2128 hashes, and there aren't significant attacks against it.

So this is not a concern if you're using a decent hash like SHA-256 or SHA-512. It also shouldn't be a concern even if you're not. I can't think of why a user would try to create a password that intentionally causes a collision, and it's not practical to use a significant number of iterations, unless you want it to take decades of calculation for your users to log in. (This is aside from the considerations mentioned in the other answers.)

Essentially, my question is: With a non-randomized-per-password salt, does the chance of a collision go up with every iteration added?

Yes, it does, from "so close to zero it will never happen" to "still so close to zero it will never happen".

svick
  • 231
  • 1
  • 2
  • 7
Matt Nordhoff
  • 3,448
  • 1
  • 21
  • 16
  • “I can't think of why a user would try to create a password that intentionally causes a collision, …” A user wouldn’t. The point is that, if Fred’s password is “confused.cow.carburetor.paperclip”, and if “1234567” hashes to the same value as “confused.cow.carburetor.paperclip”, then an attacker can login to Fred’s account with the password “1234567”. – Scott - Слава Україні Jun 24 '14 at 16:02
2

As far as the theory goes, each output of the function amazing_hash has a fixed size, and is mapped to another output of the hash function, and another, and so on.

So, leaving aside the very first input, you have a function from a finite set to itself. The function may or may not be bijective, but it's not a required property of a hash function that it is. The domain of the function necessarily is divided up into:

  • one or more cycles, plus
  • zero or more "tails", that is to say sequences that lead into one of the cycles or into another tail. When two tails join consider it arbitrary which one is leading into the other one, but I'm going to use the number of tails later, so it has to be defined that way :-) Define also the "end" of a tail to be the point where it joins another tail or a cycle.

Every point, when iterated, is either part of a cycle to begin with or else it follows a tail, and the tails that tail joins, until it joins a cycle. That's a necessary property of a function from a finite set to itself. A path cannot run forever without entering a cycle, because there are only finitely many values and so eventually it must repeat one. You can therefore imagine the function visually as a lot of circles, with branches sticking out of the sides of them. The branches all lead into the circles.

With one iteration (that is, one further hash after the initial hash), how many collisions are there? Well, it's related to the number of tails, since the end of a tail is a place where two non-equal values have the same hash. Each join point implies there is a number of colliding values equal to the number of structures joining at that point. Every tail ends in a join, so if we define "tails" and "collisions" carefully then the number of collisions is just the number of tails.

After two iterations, how many collisions are there? It's the number of tails (since once two values are collided they stay collided), plus the number of new collisions caused by the extra iteration. The extra iteration causes a collision if "both sides" of a join point are at least 2 nodes long. So where two tails join they must both be at least 2 nodes long, and where a tail joins a cycle it must be at least a 2-cycle.

After n iterations, further collisions are generated by tails at least n nodes long, and n-cycles.

In the extreme case, a hash function that is bijective has no tails. This is theorem one of finite functions: every permutation partitions its domain into cycles. Then it should be easy to see that no matter how many iterations you do, there are no collisions (other than those caused by the initial hash, of course). Each point just moves around its cycle. By moving every point an equal number of steps around a cycle, they're still all in different positions.

Otherwise, to begin with the more iterations you do, the more collisions get generated as the tails merge into the cycles. However, there is an upper limit to this process, because every tail and every cycle has a finite length. Eventually you will cause no more collisions when you do more iterations. That won't happen until you've reached the length of the longest tail in your function.

This is all in theory: in practice the longest tail might still be rather larger than you have time to iterate. If so you would keep increasing the number of collisions for as long as you can practically run.

However, the number of collisions that each iteration introduces is still very small compared to the hash space, so small that it's incredibly unlikely you'll encounter a collision this way. How do we know this? Because if it wasn't, then Floyd's cycle-finding algorithm would be an effective means to find collisions in the hash function. The hash function would not be "amazing" per the assumptions of the question, it would be known to be broken :-)

Steve Jessop
  • 2,018
  • 11
  • 14
1

Would, in theory, many passwords collide?

In your example, what you are looking at is how easy is it for two inputs to get the same output, known as collisions. This is an important area in cryptography, is used to evaluate the strength of the algorithms, and varies for every algorithm.

The number of iterations does not matter, even in your example, because all that is interesting is the following:

hash(n,mznxc,mnzxc)     > common_thing 
hash(ioasjdknckauyieuh) > common_thing 

Since you are taking the output of a hash and then pushing it back as the input, the input and output are the same size (except for the very first input).

Algorithms, such as MD5, are known to have been shown some collision vulnerabilities. MD5 collisions are also known to have been exploited in the Flame virus, even though when MD5 was invented as a secure replacement of MD4. And thus, cryptography relies on review and research of a large number of cryptographers to figure out which algorithms have not yet shown any weaknesses.

Thus, at any point in time, you have to look at which hash functions do not have any known vulnerabilities, and design your system such that in the future, you can rollover the hash function (i.e. be crypto-agile).

With a non-randomized-per-password salt, does the chance of a collision go up with every iteration added?

Non-randomized salts do not solve this problem. They solve the issue of rainbow tables and password collisions (i.e. if two users have the same salt, same password and same number of iterations, you get the same hash.) From a design perspective, you must assume that the salt and number of iterations are publicly known (even if you manage to hide them).

Given that many users share the same passwords ("123456", "password", "abcdefgh", etc.), with non-randomized salts and iterations, predicting passwords becomes much easier using frequency analysis due to same hashes resulting from same passwords.

Omer Iqbal
  • 584
  • 2
  • 10
1

So, put simply: what are the odds of input A and input B hashing to the same thing after N iterations? (The salt doesn’t change anything about this.) Because H(A) and H(B) should be uniformly randomly distributed for a good hash function, this is about the same as the odds of H(A) and H(B) not colliding times the odds of H(A) and H(B) not colliding after N − 1 more rounds. For SHA-256 with 2256 possible outputs (ideally), that’s 1 − ((2256 − 1)/2256)N &approx; 2.59 × 10−59 for 3 quintillion iterations.

That’s not very likely.

You can also estimate the probability of going into the loop that other answers have mentioned with the birthday problem approximation, though, as other answers have also mentioned, that won’t cause a collision unless the two inputs are synchronized in this loop.

For SHA-256 and 3 × 1018 iterations again, that’s 1 − e−(3 × 1018)2/(2 × 2256) &approx; 3.89 × 10−41.

Also not very likely.

Ry-
  • 254
  • 1
  • 10
0

No.

It's all about entropy.

Given a "non-broken" cryptographic hash function, it produces N bits of pseudorandom output for any input of length M, which is nothing but extracting N bits of entropy out of those M bits.1
Of course this only works in a reasonable way if M >= N, since you can hardly extract N bits of entropy if the input doesn't contain that much at all.

The likelihood of collisions is described by the well-known birthday paradoxon (which ironically does not at all work with actual birthdays, since these are very unevenly distributed!).
The likelihood for users choosing identical passwords is much, much higher than that. Or put differently, the entropy contained in a user password (even a relatively good one) is abysmal.

Salt adds entropy to the input. Which means that the first iteration with a static (I assume that "static" still means "per user"!) actually reduces the likelihood of a collision as compared to the plain password.

Now what happens on the second, third, and ith iteration? The hash function takes as input the output of the previous round, which contains N bits of entropy (which already includes the entropy in the static salt, so appending the salt again still leaves it at N), and it outputs N bits of entropy.
The CPU spins, bits flip around, the numbers look different, but nothing changes as far as entropy or collision likelihood is concerned. N bits in, N bits out.

So no, it does not get worse (but also, it doesn't get better).


1 This is, for example, the reasoning behind DJB telling you to hash the key that you get from your curve25519 function a few times (in addition to making an attack on the EC much harder). The curve has a strength of ca. 128 bits, and the function outputs a string of 32 bytes. Which means you have a blob of 256 "random looking" bits with only 128 bits of actual entropy inside, but you have no clue where it is. Which bits do you use? Hashing the 256 bits into 128 bits solves the problem elegantly without risking to throw away useful bits.
Damon
  • 5,211
  • 1
  • 20
  • 26