Why limit passwords to ascii printable characters?

Question

Possible Duplicate:
Do non-keyboard characters make my password less susceptible to brute forcing?

Every article on password security that I read tells people to make the password more complicated by using a wider range of characters. They say don't only use a-z but also mix in some A-Z, numbers 0-9 and some punctuation. Basically use all of the characters on your keyboard. However, I am building websites designed for a non-English speaking audience. Specifically Chinese users. I have noticed that many Chinese websites also ask for passwords to be made using the same set of characters. I am left with a puzzle as to why to limit to only the core ascii set. Why not use Chinese characters or an other script's characters?

For example, instead of using "!0*%y6#!7N@6" the user could use "胜0屿%y6#!7N景6", which is the same length but significantly more complex.

My application is built in UTF-8 and is compatible with Chinese and other complex scripts. So there is no programming problem for me to allow complex characters in passwords.

By extending the possible character set of passwords to include Chinese, Japanese, Korean, Arabic characters, I can increase entropy of the passwords to incredibly high levels without making the password longer or more difficult to remember. In fact it may be easer of my Chinese users to remember a Chinese password than a English language one. It would be very unlikely that someone could brute-force or use a rainbow table to crack the password.

I can understand the Character set limits on western users where the Characters used are all the ones on a keyboard and it is quite awkward to enter a Character that is not on your keyboard. However, Chinese users have to tools on their system to enter the full Chinese character set so there is no problem for them there.

So to put the question in short. Is there any security issue in allowing users to make passwords from characters beyond the normal keyboard set?

To expand and answer AviD's point below:

When a password is entered, it doesn't remain as characters but is instead converted into a sequence of bits. These bits are the real password. The process of converting characters to bits is character encoding. ASCII is one such encoding, though now rather old and limited in size. Another common one is Unicode which has evolved in the UTF-8 encoding that most websites are recommended to use today.

Unicode and UTF-8 are backwards compatible with ASCII so any ASCII based password would be the same in bits no matter what encoding was used when the password was entered. However there are some forms of encoding that are still popular that are not compatible with ascii many that are not compatible with unicode or UTF-8. These include encoding systems such as Big5 (used in Tiawan and Hong Kong) and GB used in mainland China.

If someone entered their password into a computer one day in one encoding and other day in another encoding then the sequence of bits send as the password would be different.

It is possible to detect the encoding system and convert at the server side. My applications already do that, converting everything that is entered into UTF-8. However, I wonder how perfect that conversion is. Would Big5 converted to UTF-8 give the same result as GB converted to UTF-8?

Additionally, there are some character encoding based XSS attacks that use sloppy character encoding and handling as their vector. Could a similar thing be use to compromise user passwords or my application where no, or little, limits are place on what characters can be imputed?

Hi @Rincewind42, welcome to [security.se]! Please note that this topic was already discussed recently, there are already somegood answers there. — AviD, Jul 27 '11 at 08:45
With respect it's not the same question. Previous threads focused on the users point of view, wither to use non keyboard characters in passwords. My question is from the developers point of view, should I allow or should I enforce a restricted ascii only character set on my users. — Rincewind42, Jul 27 '11 at 10:09
Rincewind, for security, how is that different? As a developer, you should do whatever gives you the right tradeoff between security and useability - and that question provides the answers. What is missing for you? Perhaps you can focus your question on the specific aspects you feel are missing. — AviD, Jul 27 '11 at 10:42
As you suggested, I have expanded the question to give some examples of issues that aren't covered in the other thread. — Rincewind42, Aug 04 '11 at 08:34
I'd love to see those password requirements: "8 characters min, at least one lowercase letter, at least one digit, at least one Emoji" ;) — el.pescado - нет войне, Aug 20 '15 at 11:00

score 12 · Accepted Answer · answered Jul 27 '11 at 06:49

From a security point of view, it is desirable to have a huge character set to choose characters from, just as you said. The reason why some sites encourage users to stick to ASCII is for technical and organisational reasons:

If the software was written without Unicode in mind or using an environment which does not make unicode totally transparent (such as PHP, old MySQL versions, etc.), it requires a significant amount of work to add Unicode support.

Furthermore if the software is not targeted for the Asian market, the developing company might just not care. It may have no experience and knowledge to test those charsets. Think of resolving combined characters, characters that look exactly the same but have different codes, etc.

This might result in support requests that needs to be handled and therefore result in costs.

tl;dr: If your software can handle Unicode correctly and you can deal with support requests, go for the full range.

Even if software was written with Unicode in mind, some Unicode characters have more than one possible representation, and there's no nice easy rule for identifying Unicode strings whose representation might not be constant. Code can only convert a Unicode string to canonical form if it knows the canonical form of all characters therein; since the set of defined Unicode characters is growing, code may receive characters which are legitimate, but where it's impossible to determine whether they are in canonical form or not. — supercat, Jan 19 '14 at 17:41

score 3 · Answer 2 · answered Jul 27 '11 at 06:33

3

The usual reason cited for sticking to a standardised set of characters is more around usability, both from the user's perspective, and from the helpdesk's perspective.

From a security perspective, adding more character sets increases the character space an attacker would have to search through so makes a brute force attack harder, and reduces the likelihood that a Rainbow Table suitable for your system exists.

answered Jul 27 '11 at 06:33

Rory Alsop

61,474
12
117
321

1

Help desk aside, "escaped" characters used in passwords are frowned upon, and can break some "Single Sign On" (SSO) implementations. (I am not advocating SSO). It may need to survive SQL injection contermeasures, there may not be enough spaces in the field etc.etc. – mckenzm Jun 13 '15 at 17:42

Why limit passwords to ascii printable characters?

2 Answers2

Linked