59

I know for a fact that some sites/apps with low security restrict passwords to alphanumeric characters only, and some allow a slightly broader ASCII range. Some sites/apps also support Unicode.

Passwords are usually meant to be typable on any generic keyboard, so they are typically generated using the commonly available characters. But for passwords which will only be kept digitally, would it be a good idea to maximize the guessing time by using the entire Unicode range of characters? Or are there reasons to believe some or most Unicode supporting sites/apps could still limit their allowed character range?

Anders
  • 65,052
  • 24
  • 180
  • 218
person of entropy
  • 589
  • 1
  • 4
  • 5
  • 32
    Whenever you keep a password "only digitally" (which sounds like "password manager" to me), wouldn't you always be able to create passwords with high entropy without Unicode and only alphanumeric characters? Or to put it in other words - correct me if I'm wrong: I'd assume that services that let you use Unicode for passwords don't have strict length limitations, therefore entropy isn't a problem when using alphanumeric characters. On the other hand, services with unncessary length limitations wont allow Unicode characters. Entropy will be a problem either way. – Tom K. Jan 16 '18 at 14:07
  • 22
    You could just use a somewhat longer password and safe yourself the headache. – CodesInChaos Jan 16 '18 at 14:26
  • 11
    Be careful about how different systems may interpret the unicode strings, not all apps, hosts, dbs, etc will necessarily handle unicode the same way: https://security.stackexchange.com/a/85664/36538 – Eric G Jan 16 '18 at 15:06
  • Also related: [Do non-keyboard characters make my password less susceptible to brute forcing?](https://security.stackexchange.com/questions/4632/do-non-keyboard-characters-make-my-password-less-susceptible-to-brute-forcing), [Will using unicode chars in my password increase security?](https://security.stackexchange.com/questions/4943/will-using-unicode-chars-in-my-password-increase-security), [Why limit passwords to ascii printable characters?](https://security.stackexchange.com/questions/5694/why-limit-passwords-to-ascii-printable-characters?rq=1) – Arminius Jan 16 '18 at 15:11
  • 7
    You'd have to be *very* careful which services you use this on. If you suddenly find yourself having to type the password on an unfamiliar machine you may be in considerable difficulty. My goto example of this is a situation I've faced more than once: having to log in to print e-tickets at a hotel business centre (they weren't made available until after I'd left home; printing was required because they can't tear half a PDF off your phone to satisfy their antiquated system). The consequential loss can be rather expensive – Chris H Jan 16 '18 at 16:15
  • 1
    agreed. I recently pasted such a generated password into my Admin account of a VM, assuming that I would always use copy/paste from Keypass for logins. however, the login prompt does not accept pasting, and neither autotyping high-Ascii from Keypass. I had to spend several hours to learn and write down 64 Alt-codes to be able to log back in. – Aganju Jan 16 '18 at 17:55
  • 2
    Maybe just use a longer password. If it's processed programmatically length is not an issue. – usr Jan 16 '18 at 18:42
  • 2
    If it's not used for human input, why not just a random byte sequence? Otherwise, if you make me type `ᚷᛖώдლಸஇ을身` as my default password, I will probably want to have you flayed alive. – J... Jan 16 '18 at 19:07
  • Also related: [Should users be allowed to use any special character they want when creating a password?](https://ux.stackexchange.com/q/72568/46361). (Yeah, it's a popular question.) – Mark Jan 16 '18 at 19:10
  • 4
    Not that it really matters, but I think most uses of "Unicode" in the question and answers should technically be "UTF-8". (Unicode maps between characters to numbers; UTF-8 maps to numbers to byte sequences.) – David Z Jan 16 '18 at 21:08
  • Are you planning to have people copy and paste the password? Otherwise I don't see the difference between calling it "a random Unicode password" and just generating a random sequence of bytes of the same length and who cares if they correspond to a valid encoding for Unicode. – detly Jan 16 '18 at 22:03
  • 3
    What do you expect the benefit to be over generating a long sequence of random bytes? They could be encoded using base64 if you want them to be typable, but the entropy would be the same. – jpmc26 Jan 16 '18 at 22:34
  • 2
    It doesn't seem that anyone else has brought this up: ***Most people don't use English.*** Support unicode, not as some odd security tactic, but because most people on earth, and on the internet, don't use English. – Alexander Jan 17 '18 at 17:17
  • I understood OP to be asking if bruteforce guessing systems will cycle through 208 keyboard characters per position rather than a much larger UTF-8 or other Unicode set per position and would it make it harder for a bruteforce guessing system to get to your password. – KalleMP Jan 17 '18 at 21:19
  • 1
    Unicode is a can of worms, with non-printable, combining, zero width, direction changing characters, locale dependent string comparison and crapload of stuff I don't want to know about. Unless you really know what you are doing, stay away, otherwise it will make your life miserable really easy. – n0rd Jan 19 '18 at 01:19
  • @TomK. “Whenever you keep a password "only digitally" (which sounds like "password manager" to me)” – You might need a password to access your password manager. – Guildenstern Jan 11 '20 at 14:17

14 Answers14

122

This sounds a lot like Fencepost Security. Imagine you're running a facility that has chain-link fencing around it that is 500 feet high. How much would the security improve by making that fencing 3,000 feet high? None - because anyone trying to get in isn't going to climb the 500 feet; they're going to dig underneath, cut a hole, etc.

Likewise, you've got a password that's, say, 20 random alphanumeric characters. That's 62^20 possibilities. You're considering changing it to 20 random unicode characters. Which raises the possibility space much higher, except brute-forcing a 20-character randomized password isn't how things are going to get compromised.

Laurent B
  • 103
  • 2
Kevin
  • 721
  • 1
  • 4
  • 3
  • 3
    I like the idea, but more possibilities per character with the same string length (in characters) would probably account for a thicker fence, in the same way that more ascii characters would. – Baldrickk Jan 16 '18 at 16:44
  • 15
    @Baldrickk Most of the commonly used hashing algorithms doesn't go over 72 bytes anyway. A 12k bytes password is as secure as a 72 bytes password. – Ricky Notaro-Garcia Jan 16 '18 at 20:19
  • 11
    @Baldrickk I think what is meant is that something like social engineering or bypassing the authentication scheme is more probable than the passwords being exposed to a brute force attack. – SBoss Jan 16 '18 at 23:13
  • 2
    I should note that by idea - I meant the metaphor. I should have been clearer. – Baldrickk Jan 16 '18 at 23:31
  • +1 for @RickyNotaro-Garcia This is the most believable and likely (and scary) fact I have read about password security in a long time. It also means that for anything other than a dictionary search the bruteforce method can be achieved with any kind of input strings and formats, simple string of hex digits will likely work if time is of no consequence. – KalleMP Jan 17 '18 at 21:25
82

This is a good idea from a security perspective. A password containing unicode characters would be harder to brute-force than a password containing ASCII characters of the same length. This holds up even if you compare byte-length instead of character length, because Unicode uses the most significant bit whereas ASCII does not.

However, I think it wouldn't be practical since Unicode bugs are so common. I think if you use Unicode passwords everywhere you will encounter more than a couple of sites where you would have problems logging in, because the developers didn't correctly implement Unicode support for passwords.

Sjoerd
  • 28,897
  • 12
  • 76
  • 102
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/71947/discussion-on-answer-by-sjoerd-is-it-a-good-idea-to-use-the-entire-unicode-range). – Rory Alsop Jan 20 '18 at 00:24
  • Not to mention, different keyboards in different locales will give you different unicode. What works on one computer may not work on another computer. – forest Jan 20 '18 at 12:00
24

Ignoring the Security by Obscurity argument it is a basic question of entropy. An 8 character unicode password is more secure than an 8 character ASCII password but less secure than a 64 character ASCII password.

In general I agree with Sjoerd - these are likely to cause more inconvenience than benefit. On top of this if ever you need to manually enter a password random Unicode is likely going to make your life miserable.

However for the edge case where you need to use a service which actively supports unicode whilst enforcing a maximum password length limit (again ignoring this usually indicates other security failings) there is an argument for it.

Hector
  • 10,923
  • 3
  • 41
  • 44
  • 1
    +1 for the active support remark. Having "escaped" characters or less than full support can be very problematic. If it cannot be reliably used then it is not "secure". – mckenzm Jan 18 '18 at 02:06
  • 1
    "An 8 character unicode password is ... less secure than a 64 character ASCII password." Some systems truncate passwords to 8 characters, in which case an 8 character unicode password is more secure than a 64 character ASCII password. – Acccumulation Jan 18 '18 at 21:03
15

The only valid reason I can think of for using Unicode characters in passwords is if the number of characters (not bytes) in a password for a particular site is limited (like this dumb bank that previously had a max of 10 characters), so that it would be easily guessed in a day or two. In this case, you can use Unicode (if the site owners let you) to get more entropy into your password in the mean time while you ask the site owners to comply with NIST 800-63-3 and remove length restrictions (and hash properly so password storage is not a concern).

I also wanted to correct a misconception here:

Unicode uses the most significant bit whereas ASCII does not.

While true for normal ASCII, the extended-ASCII that some password managers (e.g. KeePass) can use in password generation uses every single bit of each byte, thus having a higher entropic density than even Unicode, which still has some structure to indicate how many of the following bytes are part of the same character (Note that there is such a thing as an invalid byte sequence in Unicode).

Since a site that limits you to short passwords probably doesn't even hash their passwords properly (causing Unicode passwords to fail or be stored with the wrong encoding), you should almost never waste the time to bother with Unicode passwords because while you are so distracted with your funny characters (and the fact that you have to reset your password because it was stored strangely), an attacker could be guessing your (in)-security questions or using chocolate cryptography to gain access to your account.

NH.
  • 1,014
  • 1
  • 9
  • 20
  • While some recommend an additional use case (["Showing off to your friends"](https://makemeapassword.ligos.net/Generate)), this is invalid because passwords should not be shown to your friends :). – NH. Jan 16 '18 at 18:49
  • 1
    KeePass could only use every bit of every byte if the server used an 8-bit encoding, KeePass knew which encoding that was, and every stage of transmission was 8-bit transparent. From the source code (PwCharSet.cs) it appears that KeePass's misnamed "high ANSI" option only includes U+00A1 through U+00AC and U+00AE through U+00FF, i.e., the printable non-ASCII Latin-1 characters. If the password is encoded as UTF-8, the entropy per byte will actually be lower when "high ANSI" is checked than when it's unchecked. – benrg Jan 17 '18 at 15:49
  • 1
    (Not that entropy per byte matters anyway, unless the passwords are truncated before being hashed.) – benrg Jan 17 '18 at 15:54
  • @Sjoerd is correct: ASCII is 7-bit. An 8-bit encoding might include ASCII as a subset, but it is not ASCII, it's a superset. And there are many 8-bit encodings -- even just in the ISO/IEC 8859 series. The problem of wrong encodings exists even with 8-bit encodings. – Rosie F Jan 18 '18 at 18:59
11

This approach could reduce your overall security in certain cases, not improve it.

Information Security consists of three attributes: Confidentiality, Integrity, and Availability (the CIA triad). By focusing exclusively on one, you can easily overlook the importance of the others.

Confidentiality of passwords is achieved through the principles of entropy: how 'unguessable' is your password? This is commonly measured by the size of the brute force guessing space, expressed in terms of powers of 2 or bits. A brute force attacker has only so much capacity to guess; by selecting a longer password to increase this entropy you can exceed any known or predicted capability to guess. Getting the entropy over 80 bits (or pick your value) will put the password out of reach of even nation state actors. Regardless of the overly simplified description above, the point is that going above and beyond whatever "out of reach" is doesn't significantly add to your security. And it isn't relevant to security if you achieve the desired entropy by using 10 Unicode characters or 17 ASCII characters.

Availability means "can I get to my data when I need it?" If you use full Unicode character sets, you risk running afoul of various sites that don't support Unicode, or browsers or OSes that implement Unicode incorrectly, or sites that invisibly translate Unicode to ASCII under the covers. The resulting confusion increases the risk of restricting your future access to the data. This represents a potential decrease in future Availability.

In general, the likelihood of an attacker brute forcing your 80 bit password is not nearly as high as the likelihood of encountering a poorly coded site that doesn't handle Unicode properly. Therefore your overall security could be decreased instead of increased.

Of course, many sites have password length and other restrictions that dramatically limit the entropy of your passwords, too. In those cases, using the full Unicode set may increase the entropy of your passwords, assuming they don't have other hidden flaws. So on those sites, you may be improving your security; but it's virtually impossible to tell from outside if a site is properly handling your password data.

John Deters
  • 33,897
  • 3
  • 58
  • 112
  • 2
    With poor support for UTF-8 a lot of times, Integrity is also at risk (at least of the password, maybe not of the data the password protects). – NH. Jan 16 '18 at 20:01
  • 2
    If a site limits you to say 10 characters, I very much doubt they will allow 10 Chinese characters. Imagine the site where you enter the password first silently converts it to UTF-8 and removes the highest bit. Now you're stuck. – gnasher729 Jan 16 '18 at 21:31
3

This might not be a direct answer, but if the password is only kept digitally, then you should ask yourself why you are generating a password at all, instead of a byte-array. Once you look at the whole thing as simply bytes, the question doesn't apply anymore.

Also, length > complexity in all things related to passwords.

Tom
  • 10,201
  • 19
  • 51
  • 2
    Most applications simply don't accept an arbitrary array of bytes as password. E.g. try to put a null byte in the password on your favorite website. – Dmitry Grigoryev Jan 17 '18 at 15:25
  • @DmitryGrigoryev perhaps Tom was suggesting a HEX representation of his byte array. The source of the entropy cannot be determined by guessing if there is more than a dictionary attack can find. – KalleMP Jan 17 '18 at 21:31
  • It isn't clear from the question if the OP is looking for a password he can enter anywhere, or for a login implementation to use for his own site. – Tom Jan 18 '18 at 06:19
3

There is an RFC on your problem: Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords whose abstract is:

This document describes updated methods for handling Unicode strings representing usernames and passwords. The previous approach was known as SASLprep (RFC 4013) and was based on Stringprep (RFC 3454). The methods specified in this document provide a more sustainable approach to the handling of internationalized usernames and passwords.

If you read French you can also find a good explanation of it here: http://www.bortzmeyer.org/8265.html.

Section 8 of the RFC deals specifically with security of passwords using "any" unicode character, with the following sections:

  • 8.1. Password/Passphrase Strength
  • 8.2. Password/Passphrase Comparison
  • 8.3. Identifier Comparison
  • 8.4. Reuse of PRECIS
  • 8.5. Reuse of Unicode
Patrick Mevzek
  • 1,768
  • 2
  • 11
  • 23
  • 1
    This. Inplementing Unicode is [incredibly](https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/) [hard](https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129%236163129) and things like using a defined normalization are essential for even having a chance to make it work. – Dubu Jan 18 '18 at 10:14
2

Adding to the other good answers I just want to point out what can go wrong along the pipe with using the full Unicode set for passwords.

Assuming you randomly use a more-or-less valid UTF-8 string,

  • there are characters interpreted as line-breaking in the middle of the Basic Multilingual Plane like U+2029 Paragraph Separator. You might not be able to enter them in a plain <input> field.
  • there are both whitespace characters that might or might not be stripped by a language's trim() method
  • if you happen to use Surrogate characters (U+D800-U+DFFF), non-characters like U+FDD0, Private-Use characters, or non-assigned characters (although valid UTF-8 sequences) the behavior of any given set of tools is basically undefined (stripping, replacing, rejecting, doing nothing, or any combination of those)
  • if you happen to add diacritics, any tool underway might change that to NFC or NKD representation, changing the bytes in the password.
  • and that's just from the top of my head. I'm sure I forgot a lot of other possible problems, if passwords are chosen from the whole Unicode range.

So, similar to what @JohnDeters suggested, it might be a bad idea because the advantage of a larger source space is outweighted by the movable parts of text processing along the way.

Boldewyn
  • 121
  • 3
1

What counts for the security is the entropy of the password - how many bits of actual information are there in the password.

The other side of the problem is how difficult it is to remember the password, and to type the password. Imagine you try to type your password on an iPhone and you realise you can't (I haven't checked how hard it is to type arbitrary Unicode characters). Or you realise that it is very, very difficult to type it 100% correctly. Or it just takes you ages - I might have a password with the same entropy, and more characters, but twice as fast to type. And you need four attempts to get it right, while mine is right the first time.

gnasher729
  • 2,107
  • 11
  • 16
1

Others have amply remarked on the risk that the services you use those passwords for will not implement Unicode correctly. I'll add that services that today do might tomorrow cease to do so, but otherwise I'll skip that topic.

One element that I think must be considered here is to ask: how long does a password need to be to achieve a certain security level? Let's suppose that we want our passwords to be as strong as an 128-bit cryptographic key (which is probably overkill for most website passwords; I recommend 80 bits). If you stick to random ASCII passwords, then a 19-character password drawn from the ~95 printable ASCII characters reaches that level. (The math: a set of about 100 elements is about 6.6 bits/element (since log2(10) ≈ 3.3, and log2(100) is twice that), and 128 ÷ 6.6 ≈ 19.4. So 19 ASCII characters is actually about 126 bits, not 128, but meh.)

Unicode currently has about 130,000 codepoints defined, a number we'll just approximate as 2^17. This means that to reach the 128-bit level you need 7 Unicode codepoints (128 ÷ 17 ≈ 7.5, so 7 codepoints is only about 119 bits, but, again, meh.)

For 80-bit security level, which I think is more sensible for most websites, it's 12 ASCII characters vs. 5 Unicode codepoints.

Are you willing to take a massive usability hit and risk website bugs just so that you can have 5 or 7 character passwords instead of 12 or 19 characters? I just don't think it's worth it.

Luis Casillas
  • 10,361
  • 2
  • 28
  • 42
0

Well, Unicode is 'just' a list of something over a 130000 characters. UTF-8 is the most common encoding that takes that one big number and 'converts' it to a base-256 number (or more precisely, the rules make more sense in binary octets) according to a set of rules. Thus if you want to use a utf-8 encoding, you'd be bound to a lot of rules effectively decreasing the randomness you might desire. And I don't know how you might to use the whole Unicode interpretation.

If you are not concerned about printable characters, you might consider the whole ASCII (or more preferably some 8 bit extension), but at that point, why even bother with character interpretation standard at all? Couldn't you simply use some simple formless random binary structure then?

JonnyRobbie
  • 101
  • 1
0

Even if you're using digital storage only, I'd hate to be the user who needed to type something in and didn't recognise the difference between ( and (. (Hint: They're double- and single- spaced)

Using this width-sensitive example, as noted in other answers you're expecting support of these to work - depending on the SQL collation set here, would have an incorrect password being correct!

David M
  • 101
  • 1
0

No. You want to increase the base of an exponential function while risking a lot of things breaking (i.e. devices which cannot type your special characters, etc.). Calculate the entropy of your unicode password and make your ascii password longer until it has more entropy.

a-z alone is 26^(length). lets say you get 256^(length) and possibly 2 bytes per character with unicode. Then you can find the break even 26^(ascii_length) > 256^(2*unicodelength) somewhere. Choose this length as ascii_length and you can still write down your password and have the same security.

If the site does not support long passwords (shame on them), I would suspect they cannot guarantee good unicode support either. Maybe you will be locked out the next time they upgrade some internal library. So why risk a problem there? And a problem, which is hard to explain to the user support, which hardly knows what unicode means.

allo
  • 3,315
  • 11
  • 24
0

It is not a good idea to generate random Unicode passwords, because the generated password may be unreadable to the user. But the text of the question talks about using Unicode passwords, and this is a good idea and recommended by NIST 800-63-3 section 5.1.1.2 Memorized Secret Verifiers which says:

Verifiers SHALL require subscriber-chosen memorized secrets to be at least 8 characters in length. Verifiers SHOULD permit subscriber-chosen memorized secrets at least 64 characters in length. All printing ASCII [RFC 20] characters as well as the space character SHOULD be acceptable in memorized secrets. Unicode [ISO/ISC 10646] characters SHOULD be accepted as well.

For purposes of the above length requirements, each Unicode code point SHALL be counted as a single character.

It continues:

If Unicode characters are accepted in memorized secrets, the verifier SHOULD apply the Normalization Process for Stabilized Strings using either the NFKC or NFKD normalization defined in Section 12.1 of Unicode Standard Annex 15.

To summarize: The use of Unicode increases entropy and makes life easier for user who are not proficient with English.

  • Could you elaborate what this is adding to the other answers? – Tom K. Jan 18 '18 at 19:40
  • I pointed out the difference between generating a Unicode password and choosing one, and that entropy is not reduced as claimed by other answers but increased since one should count the characters rather than the bits. Also, NIST not only allows Unicode but actually recommends its use, which is a direct reply to the question. – Jonathan Rosenne Jan 18 '18 at 19:48