0

I am trying to come up with a way to measure the entropy of a password that is easy to remember, based on a list of common English words, loosely based on this XKCD comic. I'd like to know if my math is correct or if my assumptions about "easy to remember" are flawed.

I'll consider the number of common English words in the dictionary to be the variable d.

I'll consider the number of words to use in the password to be the variable n.

If the US-English keyboard is considered to contain all the likely characters that would reasonably make up a password that is easy to remember, I count 96 total symbols that can be directly keyed, including uppercase and lowercase letters.

Those are:

TAB SPACE ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 ~`!@#$%^&*()_-+= {[}]|\ :;"'<,>.?/

I'll consider these characters to be extra complexity which can be added to the dictionary words, or the variable e as 96.

So that they are easier to remember, I'll only consider adding symbols to the beginning or end of a word, or as a word by themselves, but not arbitrarily placed inside of a dictionary word. That should mean there are 2*n+n+1 positions available for each symbol which is added. I'll use the variable s for the number of extra complexity symbols added.

The equation for the total possible combinations in use should then be:

combinations = d^n + e^(s*(2*n+n+1))

Therefore, the number of bits of entropy this kind of password provides should be:

bits = log2(commbinations)

Is my math correct?

Are my assumptions about rules for an easy to remember password flawed?

Smack Jack
  • 39
  • 4
  • This question must have been asked 10 times already. I'll never say it enough, password entropy is an absolutely meaningless measure. The actual entropy of your password in an attack context depends on how much information has been leaked about you and your other password choices, and on probabilistic models of password creation, just as much as the actual 'storage size' of your password alphabet. Now to decide which question yours is a duplicate of... – Steve Dodier-Lazaro Aug 28 '16 at 11:22

2 Answers2

1

I didn't check your combinations formula (as I explain below), but if users choose between all possible combinations with equal probability, then yes, you just take the base-2 logarithm of the number of possible combinations.

But the most important part of a scheme like this is what I boldfaced: how do you get users to choose between all of the combination with equal probability? The XKCD comic really fails to address that, and you see people too often fail to grasp that point. There is however the Diceware method, an older version of the same concept, which addresses this problem by instructing users to throw dice as part of the process for generating a passphrase.

The other thing that you should contemplate is whether your scheme of putting additional symbols on each word is actually worth it. The additional symbols aren't magical—they just increase the number of combinations, something you can do much more simply simply by using more words; any gain you could get for increasing s, you can obtain by increasing n instead. This is why I didn't check your combinations formula—I'm skeptical of the value of the additional symbols.

Luis Casillas
  • 10,361
  • 2
  • 28
  • 42
  • Yes, it does assume equal probability of choosing random words. It also assumes adding extra characters adds something worthwhile. This is in fact why I want a formula, so I can find out what actually provides the most value, adding more words, adding symbols, choosing words from a larger dictionary, etc. – Smack Jack Aug 27 '16 at 23:53
  • BTW - I believe your comment about the unlikelihood of a person selecting truly random words really just means the value for **d** is smaller than it may seem at first. It is still a computable variable. – Smack Jack Aug 28 '16 at 01:03
1

The entropy essentially reflects the amount of unknown information. The idea is that the higher the entropy is the harder it is to guess the password. An to compute the amount of unknown information your computation model must be aware which information are already known to the attacker. This means you cannot base your model on how users create passwords but it must be based on how passwords are guessed, i.e. how much effort it would be to construct a specific password with the common password crackers or how coworkers might guess the password etc - depending on what attack vector you consider.

If you consider that the hashed passwords might be compromised by some external attacker and offline attacks are possible then you need to study the methods used in modern password crackers. These methods are usually based on huge lists of passwords which got captured during attacks. Then these password crackers employ various typically used modifications on these data before trying brute force.

If you instead consider coworkers as attack vector you should also consider that these coworkers have specific knowledge about the attacked user which might aid in finding the correct password. Thus you need to add these information (like name of parents, spouse, kids, dogs...) also to the computation.

And there are probably different scenarios possible you need to study, depending on your environment.

Steffen Ullrich
  • 190,458
  • 29
  • 381
  • 434
  • This isn't a question about any or every possible password, but a question about measuring the entropy of a specifically constructed password. Assuming the inputs are random, this method should describe a precise entropy relative to its available inputs. We know how the password is constructed--it isn't in an existing password list (except by pure chance). There is no assumption that an attacker knows anything about who has constructed it, only that they know how it was constructed, and what it was constructed from. It seems to me like the amount of unknown information is well known. – Smack Jack Aug 28 '16 at 04:36
  • For clarity, I'm not asking about how to attack a random or specific person's password, I am asking how to measure the work required to crack a specific type of password that a random attacker will need to perform to discover said password. – Smack Jack Aug 28 '16 at 04:41
  • @SmackJack: Since you make no specific assumptions about the attacker in your question I can only assume that the attacker is using common tools. Anyway: the important thing is that the model for computing the entropy must be based on how the attacker guesses passwords and not on how the user might create the password. – Steffen Ullrich Aug 28 '16 at 04:44
  • @SmackJack: "I am asking how to measure the work required to crack a specific type of password that a random attacker will need to perform to discover said password. " - exactly. And therefore you need to consider how the attacker will discover the password and not how the user might create it. – Steffen Ullrich Aug 28 '16 at 04:49
  • If I am handing the attacker a cheat sheet of how I created the password, they are already ahead of the game, since they will know this is not in their cultured password lists, and that they will not need to brute force every possible random combination. This is a mathematical question about what they are left with once they know all of that. It should be a known entropy; smaller than randomly generated password, larger than previously used passwords, and I would guess, still enormous. – Smack Jack Aug 28 '16 at 04:49
  • @SmackJack: attackers rarely need to brute force and you also don't need to handle them a cheat sheet because they already have the commonly used cheat sheets and passwords. Again - if you want to compute how much unknown information are left in a password you must understand how much information the attacker already has, i.e. the password lists and the methods used for varying the passwords. – Steffen Ullrich Aug 28 '16 at 04:54
  • @SmackJack this *is* a question about measuring password entropy because you ask about a completely trivial case of symbols taken from an alphabet. Just apply Shannon's formulae like everyone else, and remember not to make the (wrong!) assumption that all combinations are equally likely to occur. Your question is a duplicate, and the accepted answer for that other question contains the proper method for calculating entropy from the perspective of a service provider. – Steve Dodier-Lazaro Aug 28 '16 at 11:26
  • @SteveDL "I'll never say it enough, password entropy is an absolutely meaningless measure. The actual entropy of your password in an attack context depends on how much information has been leaked about you and your other password choices" --- I don't see how that can possibly be true. I haven't picked a password yet, and I may pick one that is generated randomly. – Smack Jack Aug 28 '16 at 23:41
  • @SteffenUllrich, if I am understanding you correctly, are you saying that it is not a valid question to ask how much inherent entropy a password has because there is outside information that weakens what I think isn't known? If an truly random generator is used on a method I've described, how does a password list help in any way? What am I missing? – Smack Jack Aug 28 '16 at 23:46
  • @SmackJack: There is no such thing as "inherent" entropy. Entropy reflects the amount of unknown information which depends on the information known, in this case which are known to the attacker. Please study [How can we accurately measure a password entropy range](http://security.stackexchange.com/questions/4630/how-can-we-accurately-measure-a-password-entropy-range). – Steffen Ullrich Aug 29 '16 at 04:22
  • "Easy to remember" passwords are certainly not random. And many random password generators actually produce short passwords with specific patterns, because they need to be "easy to remember". If you use a truly random password -- let's call it a key, you still need to account for reuse. – Steve Dodier-Lazaro Aug 29 '16 at 08:28
  • Specifically right now, the way the Shannon formula works requires us to already discover all sources of information leakage and calculate all the probabilities of each possible combination based on that leakage. This includes probabilistic models of password choices for a given service (the work done by Weir at CMU), and models of password reuse and cross-service leakage for a user identity (there's been a project on that at UCL but it wasn't released yet and I don't think if someone's continuing it). – Steve Dodier-Lazaro Aug 29 '16 at 08:30