15

It is known that people reuse their passwords across different sites. The reuse can be done in two ways:

  1. using exactly same password for two different sites,
  2. slight modifying the password of one site and using it on other. We call them similar passwords.

For case 1. comparing passwords directly reveals the password reuse.

For case 2. there is no clear measure as how to measure the distance between two strings. The most popular metric is Levenshtein or edit distance. For example the edit distance between "password" and "password1" is just 1 corresponding to the insertion of letter '1'

However, the Levenshtein distance doesn't always work well. For e.g, if the password of some site say x.com is "monday", and the password of site y.com is "wednesday", the edit distance would be 5. Assuming that the attacker knows the password of x.com i.e. "monday" and the positions where edits have been made for site y.com, atleast 26^5 variations of password monday should be attempted.

But conceptually "monday" is replaced by another weekday. There are 6 more weekdays other than "monday". So in this case, the intelligent attacker need to try just 6 different weekdays as opposed to 26^5 variations of "monday".

What should be the good way to measure the distance between two passwords?

miniBill
  • 335
  • 1
  • 8
Curious
  • 1,452
  • 2
  • 14
  • 26

2 Answers2

13

There is no good way. What you say, is practically the measurement of the password distance in our mind. It is clearly impossible to have a direct method to do that.

Second thing, what you want to measure, depends heavily on the person, and contains often only for him known informations. For example, one of your collegues could use the name of its childs on the different company servers. It is not possible to build a software solution to find this, but some hacker / collegaues can have this information and use them to crack his account.

What you can do, were a step into the track of the NSA: although you can't spy peoples mind directly, you can use Big Data to emulate some very similar.

What you need: publicly available informations on the net. For example:

  1. Thesaurus
  2. Wikipedia (although there is no simple way to measure the link distance of two keywords, its database is simply downloadable and you can build a script to analyze its link connectivity).
  3. Or simply you could do automatized google searches with the google search api, and get ratio of hits between the first, between the second password and between a dual query (for example, if the first password is "apple" and the second is "orange", then the Hits("apple")*Hits("orange")/Hits("apple", "orange")^2 had to be below an experimental limit set by you).

But beware: don't execute queries containing the passwords into an untrusted public cloud, it were a very serious security breach! Of course, it depends only on your viewpoints/considerations/responsibility, which public cloud is trusted for you. For me, none were.

In your place I did the following:

  1. I get a wikipedia mirror (they have simple mysql database which is publicly downloadable)
  2. Created a link distance map (it were very simple, although it were maybe big)
  3. I created for the two passwords to compare to the their nearest wikipedia article title (it needed probably a massive levenshtein comparation, so you will need a lot of cpu)
  4. Finally I used the following formula: D("pwd1", "pwd2") = Levensheiten("pwd1", Lev_nearest("pwd1")) + Wiki_Link_Distance(Lev_nearest("pwd1"), Lev_nearest("pwd2"))+ Levensheiten("pwd2", Lev_nearest("pwd2"))

Extension: wiki contains around of some 1million of text entries, which makes the shortest way search nearly impossible. You had surely implement this as a C++, and use very well optimized algorithms. Thus, it will be hard. As an alternative, you can do that you use from the wikipedia only the most common words (which can be found by getting their usage stats). Although the english wiki has around some million articles, a native english speaker knows only around some ten thousands from them.

Somebody should really write this, it were a wonderful opensource demon somewhere in the github :-)

peterh
  • 2,958
  • 6
  • 26
  • 32
10

You're in luck, there is a good way to normalize this for publicly available information: WolframAlpha can be used to reduce strings into logical components that can be compared, and result in a more accurate Levenshtein comparison.

Example for "Monday"

Once you "factor" the string into all of its possible meanings (day of week, scrabble value, etc) you can use the elements as a new comparative value.

For private information, such as sibling name, you will need to follow a similar method re-building Wolfram's structure for your proprietary store. Wolfram has an API available that exports results results in JSON and may help align your thoughts with what's needed to create such a private repository.

makerofthings7
  • 50,488
  • 54
  • 253
  • 542
  • but if you try M0nday, it doesn't work. Passwords are not always exact english text. – Curious Dec 16 '14 at 05:20
  • @Curious - If you want to include an additional, non-standard, non scientific vector, you need to add that yourself (perhaps in the *private information* parse stage. L33t-speak counts as non public/standard info that you will need to parse, bifurcate, and and test the permutations – makerofthings7 Dec 16 '14 at 05:39