Testing a password for similarity to previous hashes

Question

This question is a follow-up to a comment by Tobias Kienzler to an answer to "What is the purpose of the “Password minimum age” setting?".

Consider a password policy that new passwords cannot be identical to the current password or any of the user's five passwords that precede the current password. Nor can they be similar to the current password or any of these previous passwords, where "similar" is defined as Levenshtein distance 2 or less. (For the sake of this question, assume no complexity requirement other than six or more characters, one letter and one digit.) This protects against, say, "bush41", "clinton42", "bush-43" (which should be stopped), "obama44" (which should be allowed), and either "clinton 45" or "bush 45" (either of which should be stopped).

The ordinary password change flow requires the current password so that someone can't run a password change with a Firesheeped session cookie. The alternate password change flow sends a reset token through another channel, such as e-mail, voice, or text. In the ordinary flow, I can test whether the new password is similar to the current password by using the Levenshtein distance function from a library. But I don't have the previous password; all I have are salted PBKDF2 hashes. Nor can I test for similarity to the current password if the user has entered a password reset token instead of the current password.

In the aforementioned comment, Tobias Kienzler recommended to "compute all mutations of the new password within the unacceptable edit distance, calculate their hashes and compare them to old password hashes (with the respective salts)." In an answer to "Password security", an answer to "Does Facebook store plain-text passwords?", and an answer to "How can a system enforce a minimum number of changed characters in passwords, without storing or processing old passwords in cleartext?", apsillers, Michał Šrajer, and Ori suggested something similar.

But I don't see how it's tractable on present hardware to compute the set of all strings with Levenshtein distance 2 from a given 20-plus-character passphrase and hash them all, especially with an intentionally slow hash like PBKDF2 and a large character set like Unicode. It would appear to require around h(nc)^s hashes, where h is the length of the history (here 5), n is the number of characters in a password (and thus the number of positions in which a replacement or insertion can occur), c is the number of characters in the character set (95 for printable ASCII or much larger for printable Unicode), and s is the maximum edit distance to call two passwords "similar".

So what is this computationally feasible way to test a password for similarity to previous passwords without storing previous passwords with reversible encryption? Would it require applying any or all of the following simplifications to the definition of similarity, and if so, would those simplifications compromise the password system's security in practice?

Require all characters in passwords to be printable ASCII (space to ~).
Reduce the Levenshtein distance of a "similar password" to 1.
Compare only the current password for similarity and the previous passwords only for identity.
Willfully sacrifice availability to protect password confidentiality. When a user requests a password change, lock out the user for the remainder of the day while the system enumerates all possible similar passwords, and disallow other users from changing their passwords if too many users' similarity searches are queued.

The password string you will be testing is *bytes*, not characters, so the size of the Unicode set isn't really relevant. (Of course, a 20-byte Unicode password is likely to give you 40 or more bytes of input if all the characters are from something other than the first 128.) — Bob Brown, Jan 10 '15 at 23:38
@BobBrown Levenshtein distance operates between "two sequences" according to Wikipedia, not necessarily two sequences of 8-bit bytes in UTF-8 encoding. The password comes in over the wire as bytes but gets decoded to characters. Replacing two 16-bit elements in a sequence of UCS-2 characters (the BMP subset of UTF-16) counts as a Levenshtein distance of 2 characters. — Damian Yerrick, Jan 11 '15 at 00:44
Unicode makes my hair hurt. Replacing one byte of a UCS-2 pair changes the character for a Levenshtein distance of one, but so does replacing both bytes. It's not that neat with UTF-8; I guess that means you must canonicalize the Unicode before computing the distance. It's a good thing you know more about this than I do. — Bob Brown, Jan 11 '15 at 15:40
Related (possibly duplicate): https://security.stackexchange.com/q/71756/117921 — zypA13510, Mar 22 '21 at 08:08

Bob Brown · Answer 1 · 2015-01-11T20:09:12.700

I'd like to suggest that you consider why people change passwords. It could be because the user believes the password is compromised, or "because security;" the system is forcing regular changes in the belief that such changes somehow improve security. I would argue that a strong password needs to be changed only in the case of compromise. Further, there are thousands of weak passwords with a Levenshtein distance > 2. I could alternate through: loveyou1, hateit2, scroomall3, Shannon4, and dammit5 and be completely within the rules you put forth. Yet, all of these will fall to dictionary-based guessing in seconds.

You have correctly computed that you cannot do what is proposed within a reasonable time.

If you limit the key space to the printable ASCII characters, you violate one of Shannon's principles (the key space should not be artificially constrained) and at least potentially make your passwords less secure. "Less secure" is at least mostly theoretical, because people will at least mostly use the characters on the keyboard. Even so, we reject this alternative because it reduces the key space.

Reducing the Levenshtein distance to one doesn't really help you, as my examples show. So we reject it.

We reject locking out users as a violation of the property of availability.

You are left with comparing the current password for similarity and the previous passwords for identity, and that is what I recommend. You can't even do that when using a reset token, so you have to compare all for identity, and that is OK. People should not "change" their passwords to their current values.

Beyond all of that, I'd suggest you revisit why you want to force password changes in the absence of evidence of compromise. Mostly you will get loveyou1, hateit2, scroomall3, Shannon4, and dammit5, or things very similar.

Finally, it might be a better use of programming time to reject the most frequent 1,000 or 2,500 passwords instead of worrying about password history. In a high-security application, you might reject the top 10,000, but for ordinary users that's tantamount to rejecting everything. One such password list is here: https://xato.net/passwords/more-top-worst-passwords/

There is one valid reason to require regular password changes, see [this question](http://security.stackexchange.com/q/4704/3272): If someone obtains your password without detection, they can use it all the time until you change it. Yet another reason why one should simply use two-factor authentication, where this problem cannot occur (assuming the second factor is not clone-able) — Tobias Kienzler, Jan 12 '15 at 08:39
@TobiasKienzler: I disagree with respect to forced password changes. Depending on how the intruder got the password, he might simply get the new one in the same way. What is important in the case of masquerade attacks is that the *attack be detected.* Otherwise, one cannot undo, not even assess, the damage. That is why Unix tells you the last login time, Gmail has a login history with IP addresses, etc. (You are correct with respect to two-factor authentication.) — Bob Brown, Jan 12 '15 at 12:39
I can perfectly agree with your point, personally I dislike all those basically awkward password policies... I just thought it worth mentioning that depending on your scenario a regular password change is no bad idea - do you actually always check your last unix login time? The IP address log may however be more helpful indeed, but still requires user attention... — Tobias Kienzler, Jan 12 '15 at 12:42

Anti-weakpasswords · Accepted Answer · 2015-01-11T22:20:00.217

So what is this computationally feasible way to test a password for similarity to previous passwords without storing previous passwords with reversible encryption? Would it require applying any or all of the following simplifications to the definition of similarity, and if so, would those simplifications compromise the password system's security in practice?

One should be wary of asking how to do something before one asks if that something should be done at all.

Anything you can do to make it computationally feasible for you to quickly test a variety of candidate passwords (in your case, similar passwords), given only your own password hash database, ALSO makes it even more computationally feasible for an attacker to quickly test a variety of candidate passwords (in their case, some cracking dictionaries and some rulesets) against your leaked password database.

Every answer you can find to this issue is a weakness attackers can use against you, and even low end attackers are already gaining an advantage over you by using GPU's to test candidate passwords far faster than your server can handle in the first place.

To sum up: Your problem is 100% identical to the problem any attacker with a leaked password hash database faces.

score -1 · Answer 3 · answered Jan 12 '15 at 08:56

Personally, I truly hate such requirements and would prefer time spent on developing for this spent one actually efficient security measures such as two-factor authentication or even better simply using OpenID where you don't have to worry about the users' passwords at all and they don't have to bother coming up with yet another password according to more or less ludicrous requirements despite the site accessed being of "minor" concern to their privacy. That way you motivate them to have one very secure password, maybe even regularly change it "for real", because now they don't have to either remember which permutation of the default-password that barely exceed said change-requirements they used this time just in order so simply click the "reset password" button after the fifth frustrating "invalid username/password" message...

2FA fails if your user base isn't certain to own a smart card reader or subscribe to mobile phone service with unmetered incoming SMS. Just using OpenID fails for use on a private system, or if you intend to act as an OpenID provider, or if a user's OpenID provider goes out of business (MyOpenID, Hyves) or stops offering OpenID 2 in favor of OpenID Connect without dynamic client registration (e.g. Google). — Damian Yerrick, Jan 12 '15 at 17:04

Testing a password for similarity to previous hashes

3 Answers3

Linked