15

Some random password generators support the generation of pronounceable words of any given length n. Assuming that words are derived using 26 lowercase letters, the number of possible words of length n is 26^n. Not all of these words are pronounceable. My question is how many pronounceable words are possible of any given length n?

The count of pronounceable words of length n is obviously less than 26^n and therefore the search space for the attacker is less than 26^n. If there is no exact answer is there any upper bound on the count? Is setting a pronounceable word as a password is secure or is this just a fancy feature of password generators? What should be the minimum length?

In the context of pass-phrase, i think it as special case of pronounceable words formed by concatenation of more than one common English words. Assuming that n len words are derived using 26 lowercase letters.

Number of n len words (26^n) > Number of n len pronounceable words > Number of n len pass-phrases > Number of n len English words.

Am I correct?

Note : I have read the argument of setting long pass-phrase "horsecorrectbatterstaple" as password and don't want to start it again:)

Curious
  • 1,452
  • 2
  • 14
  • 26
  • It depends on the definition of pronounceable and the used language in my opinion. Aren't we talking about pass phrases instead of passwords? I think there should be a balance in having a long pass phrase that can be remembered (pronounceable) by a user to avoid sticking post it stickers on the monitor. – Jeroen Nov 12 '14 at 05:48
  • 1
    Passphrase is concatentation of common English dictionary words e.g. "horsebatterystaple". Pronounceable word is any word that is pronounceable, e.g. "oglucuplimk". Passphrase can be considered as special case of pronounceable words. Anyways this doesn't answer my question. – Curious Nov 12 '14 at 05:53
  • Good question. I guess you could just look at algorithm that generates those words (I'd assume it's simple), and get the formula. For simple algorithms it could be something like 20^(n/2)*6^(n/2), which means a 10 letter password would be about as strong as 14 letter pronounceable password. – domen Nov 12 '14 at 08:37

3 Answers3

10

Pronounceable words are more-or-less sequences of syllables. What constitutes a syllable depends on the language, including the language variant (British, Scottish, American, Indian... versions of English are not rigorously identical). So we will make some approximations.

Let's suppose that we want two-letter syllables, always a consonant followed by a vowel. We also want to avoid ambiguous syllables: there shall be an injective mapping from pronounced syllables to letters; thus, we will not use "c" or "q", relying only on "k" and "s". We end up with:

  • 18 consonants: b, d, f, g, h, j, k, l, m, n, p, r, s, t, v, w, x, z
  • 6 vowels: a, e, i, o, u, y

Not that this entails pronouncing "hi" as in hit, not high; "ge" as in get, not gel; and so on.

We then end up with 18×6=108 unambiguous two-letter syllables. The number of possible passwords of length n is then 108n/2, i.e. about 10.39n. If you want 10-letter passwords:

  • Sequences of 10 random letters: 141167095653376 choices
  • Sequences of 5 random syllables: 14693280768 choices

Compare this with the total number of possible English words, which has been estimated to about 470000 (only). For passphrases, the number of passphrases of a given length or up to a given length is necessarily greater than the number of words of the same length, since a single word is a kind of passphrase. However, your inference is correct: for a given length in characters, independent random letters provide the most room for entropy, followed by random syllables, followed by sequences of random words ("passphrases"), followed by single words.


The story does not stop there, though. The whole notion of passwords revolves around scarce resources. Namely:

  • The password must be remembered.
  • The password must be typed.

Thinking in terms of "number of letters" relates mostly to the second resource: more letters imply more typing efforts. Not that the character kind matters: if the user must use his smartphone, then he will much prefer a sequence of lowercase letters than a password which mixes letters and digits or distinct casings.

Using "pronounceable" passwords is a trade-off: we accept to use more letters, so that we may use a password that better fits our echoic memory. Note that the number of possible "syllabic passwords" of length n is 10.39n, very close to the number of digits-only passwords of length n (10n); but many people will find it much easier to remember a random syllable than a random two-digit integer.

Whether echoic memory is the best compromise depends on the subject. Some people work better with pictures than sound (that's called iconic memory). There are even some people who are most at ease with numbers, and would prefer n random digits over n/2 random syllables. One of the main points of the famous comic is that most people really remember stories, and sequences of meaningful words are the substance stories are made of. We may add the following points:

  • What is the best fit for a given user is not necessarily the optimal trade-off for another. Thus the need for several password generation methods, so that each user may use a method that works best with his brain and fingers.

  • Users themselves are bad at making such decisions, because they tend to think that a password that "looks complex" is more secure. In fact, "password complexity rules" such as requiring letters, digits, mixed cases and punctuation signs siphon out the scarce resources of memory and typing very fast, and don't provide that much entropy. These rules are often a bad bargain, but many security professional love them, and parrot the received dogmas about their necessity, because the rules make for visible security (they give a lot more feeling of security than actual security).

  • All such analysis relies on independent, random and uniform selection of the password components. Letters, syllables, words in a passphrase... must be generated with a computer (or dice or some other physical device), not with a human mind. Humans cannot do good randomness. The password must be generated, then the story built on the result, and not the other way round. There must be random generation that the users accept. This is the trickiest point.


Last but not least, there is a disadvantage to pronounceable passwords: they may be pronounced. Users must be made well aware that though the password can be uttered, it must not be actually spoken, since it may then fall in malicious ears. The whole idea of pronounceable passwords is that users speak them in their heads. If a password is easy to pronounce, then it is also easy to share orally (e.g. with some co-workers), which is, in all generality, a rather bad thing.

User education, as usual, is what matters most. Regardless of how passwords are generated, security won't be achieved if the users are not informed about how passwords shall be used, and what they should not do with them.

Tom Leek
  • 170,038
  • 29
  • 342
  • 480
  • Just want to add 1 more point to the nicely writtern answer. I think the number of pronounceable passwords is slightly more than 10.39^n. The pronounceable passwords of length n are not only formed by concatentation of 2 letter syllables but can also be formed by concatenating 3,4,...,n letter syllables. Therefore, the count will be summation over length n. I understand that by this method some passwords will be counted more than once. – Curious Nov 12 '14 at 14:21
  • Yes, I have used a minimizing approximation: I have deliberately restricted myself to a subset of "pronounceable passwords", so that I could more easily compute the number of combinations. The subset is still sufficient to make the main point that the number of pronounceable passwords far exceeds the number of existing English words. – Tom Leek Nov 12 '14 at 14:23
  • What could be the approach used by the available pronounceable password generators, if concatenation of 2 syllables approach is followed and user believes the length 12 is enough, then the resulting entropy is only 10^12 = 2^40. User should be educated that just generating longer pronounceable passwords is not enough. In this case the length should be much longer to achieve the same security as that of 12 length random password. – Curious Nov 12 '14 at 14:33
  • 1
    40 bits of entropy are not bad, if you use the passwords for, say, user authentication on a server that uses proper password hashing like bcrypt with a high enough iteration count. If the server uses 1 second worth of CPU to check a password, an attacker with 1000 PC will need 15 years on average to crack a password with 40 bits of entropy, which should be enough to deter him. – Tom Leek Nov 12 '14 at 14:50
  • Fair enough, but I was assuming the passwords are hashed using MD5 or other algorithms. – Curious Nov 12 '14 at 14:54
  • 3
    I'd say that if passwords are hashed with a weak algorithm then you should fix that first. This is only technology -- far easier to do than changing users' minds. – Tom Leek Nov 12 '14 at 14:58
10

Hard question to answer exactly. I'm going to refer to Theodore T'so's pwgen (v2.07) implementation exclusively here (pwgen -A0)

These pronounceable passwords use "phonemes" as "symbols", rather than single characters, in (the English language biased) pwgen a phoneme can be 1 or 2 characters. There are 40 defined (in pw_phonemes.c), 25 are a single character (a-z ,except "q"), and 15 are pairs (diphthongs), average chars per phoneme is 1.375 (closer to 1.425 in use due to consonant/vowel alternation).

Phonemes aren't combined randomly, that's the trick of course, there are rules which make the end-results pronounceable, for pwgen we have (roughly):

  1. some phonemes cannot start a word (2 phonemes excluded)
  2. some phonemes cannot follow a vowel (for 13 vowel phonemes, 8 phonemes excluded)
  3. having picked a consonant pick a vowel next
  4. having picked a vowel: after a previous vowel, or on a diphthong, or randomly (60%) pick a consonant next
  5. otherwise allow another vowel next
  6. a diphthong (2 characters) cannot be chosen as the last character (the most obvious side effect is a password will never end with "q", since q only appears as the diphthong "qu".)

(If you can formulate the exact number of permutations based on that, well done!)

A "symbol" for an [a-z] password is a single character, for a pronounceable password it's a phoneme of 1 or 2 characters.

For an [a-z] password of length N, there are 4.7 bits (lg2(26)) per symbol, its estimated entropy is 26^N or 2^(4.7*N) per symbol (4.7 bits per-character).

For phonemes we have 5.3 bits (lg2(40)) per symbol, estimated entropy for a password of length n symbols is 40^n or 2^(5.3*n) (3.9 bits per character). A phoneme password of m symbols will (ignoring any deviation caused by the above rules) be an average of 1.375m characters.

Estimating the maximum entropy for the two types of password (which have on average the same length n=1.375m) can be approximated by 26^(1.375m) and 40^m , the former grows quicker *, and proves your assertion (count of pronounceable words of length n is obviously less than 26^n)

At a minimum, a pronounceable password created this way should be about 20% longer than a straight [a-z] random password in order to have a comparable entropy. The presumed advantage is that pronounceable probably means more memorable, so for the human a longer password may actually be easier to memorise.

The constraints due to pronounce-ability limit this further.

Estimating a numerical difference is trickier... this is hopefully an "order of magnitude" approximation. pwgen's 40 phonemes break down as:

20 CONSONANT
 5 CONSONANT DIPTHONG
 2 CONSONANT DIPTHONG NOT_FIRST
 5 VOWEL
 8 VOWEL DIPTHONG

(Diphthong is mis-spelt in the source, no matter.)

I will (heavily) approximate a calculation for 3-4 phoneme (~5 character) password, based on the above rules (and with a little empirical evidence). ~80% of passwords are of the form of alternating Consonant/Vowel phonemes, i.e. C V C [V ...] or V C V [C ...], the remaining ~20% have a vowel pair, e.g. C V V C (consonant phoneme pairs are forbidden; they may occur in the output characters though, particularly due to the phoneme "ng"). (A problem here is that working out the length in characters from the phonemes makes the problem intractable. This isn't just a permutation problem, you have to work out permutations of permutations I suspect for an accurate answer).

To get a reasonable estimate by calculating the most frequent arrangements:

c v c v = 25*13*27*13   = 114075
v c v   = 13*27*13      =   4563
c v c   = 25*13*27      =   8775
v c v c = 13*27*13*27   = 123201
c v v c = 25*13*5*27    =  43875
v c v v = 13*27*13*5    =  22815
v v c v = 13*5*27*13    =  22815
                         -------
                          340119

The magic numbers here are: 25 number of consonants (incl. diphthongs) without NOT_FIRST, 27 number of consonants (incl. diphthongs), 13 number of vowels (incl. diphthongs), 5 number of non-diphthong vowels that can follow a vowel

Empirical data indicates the true number to be about 15% higher, but if more of the permutations are included they start to exceed the length of 5 characters, giving an inflated answer.

A random 5 letter [a-z] password has approx 11.9M permutations, this is less than 3% of that.

A rough approximation then, ignoring edge-cases and by considering pairs of symbols at a time, for a pwgen pronounceable password of length n characters,

P = 767 ^ (n/(2*1.4))

where 767 is ( 27*13 + 13*27 + 13*5 ), the permutations of symbol pairs c v, v c, v v, over 2 for symbols in pairs, and 1.4 reduces the character length n to the number of phonemes. (Having the estimated number 1.4 in a power makes the formula somewhat sensitive to minor changes.)

767 (valid symbols pairs) consumes an approx 2.8 characters, for 9.6 bits (log2(767)) of effective entropy, 3.4 bits per character. Compared with 4.7 bits for [a-z], we need an overall factor of approx 1.35 to bring these passwords up to comparable strength, i.e. one third longer.

For comparison, allowing random mixed case and digits in the pwgen output gets you back up to ~4 bits per character, so a pwgen password of length n (without -A0) is still less than an [a-z] password of the same length (~4.7).

(For empirical proofs, note that pwgen only uses pronounceable phonemes when the length is >=5.)

For use as a password, you may want at least 50 bits of entropy (e.g. equivalent to 8 characters from [a-zA-Z0-9] + ASCII punctuation, 6.5 bits per character.) This can be achieved with a pwgen -A0 password length of 15-16 characters (~3.4 bits per character). This is a doubling of the length to have a comparable strength password.

Number of n len words (26^n) > Number of n len pronounceable words > Number of n len pass-phrases > Number of n len English words.

All true (I assume that "n len English words" means using a single word as a password). Pass-phrases need to be quite long to be effective, perhaps 2 bits per character (e.g. 40k words, avg length 8, with unrelated words -- related words is lower). Fixed short-word dictionary style schemes like that used in RFC2289 achieve ~3 bits per character.


* Wolfram graph 26^(1.375n) versus 40^n or also try log plot 26^(1.375n) versus 40^n for n [0,16] which is cached here

mr.spuratic
  • 7,977
  • 26
  • 37
4

Restricting yourself to only passwords that are pronounceable does decrease the entropy, which reduces strength. So in theory, the password will be weaker

But password strength is a complicated beast. In particular, if choosing pronounceable passwords means you can remember a longer one than usual, then your entropy goes up and your password becomes stronger.

My take on passwords is:

  1. Use two factor where you can
  2. Use a good password manager with long random and impossible to remember passwords where you can
  3. Use Diceware pass phrases for everything else
Graham Hill
  • 15,474
  • 37
  • 63
  • In practice lots of (poorly designed) sites limit your password to like 20 chars in which case Diceware is useless. – Casey Feb 18 '16 at 20:50
  • @Casey Hence point 2; use a password manager to allow you to have 20 characters of unique random line noise for those sites, then use Diceware to make a long but memorable pass phrase for the password manager, since you have to keep that one in your head. The trick is to put everything you possibly can into the password manager: I've got it down to only four things I type so often I need to memorise them. – Graham Hill Feb 19 '16 at 11:27