After reviewing this I realised I still have a few questions left regarding the topic.
Are there any characters that should be ‘left out’ for legitimate security purposes? This includes all characters, such as brackets, commas, apostrophes, and parentheses.
While on this subject, I admittedly don’t understand why admins seem to enjoy enforcing the “you can only use the alphabet, numbers, and spaces” rule. Does anything else have the potential to be a security flaw or break something I’m not aware of (even in ASCII)? As far as I’ve seen during my coding days there is absolutely no reason that any character should be barred from being in a username.
It are often exactly those characters which can be used to inject malicious code in your program. For example SQL injection (quotes, dashes, etc), XSS/CSRF (quotes, fish braces, etc) or even programming language injection when
eval() is used elsewhere in your code.
Those characters does usually not harm when you as being the developer sanitize the user-controlled input/output properly, i.e. everything which comes in with the HTTP request; the headers, parameters and body. E.g. parameterized queries or using
mysql_real_escape_string() when inlining them in a SQL query to prevent SQL injections and
htmlspecialchars() when inlining them in HTML to prevent XSS. But I can imagine that admins don’t trust all developers, so they add those restrictions.
There’s no security reason to not use certain characters. If you’re properly handling all input, it doesn’t make any difference whether you’re only handling alphanumeric characters or Chinese.
It is easier to handle only alphnum usernames. You don’t need to think about ambiguity with collations in your database, encoding usernames in URLs and things like that. But again, if you’re properly handling it, there’s no technical reason against it.
For practical reasons passwords are often only alphanumeric. Most password inputs don’t accept IME input for example, so it’s almost impossible to have a Japanese password. There’s no security reason for disallowing non-alphanum characters though. On the contrary, the larger the usable alphabet, the better.
If your application handles Unicode input properly throughout, I’d certainly allow non-ASCII characters in usernames and passwords, with a few caveats:
If you use HTTP Basic Authentication, you can’t properly support non-ASCII characters in usernames and passwords, because the process of passing those details involves an encode-to-bytes-in-base64 step that, currently, browsers don’t agree on:
- Safari uses ISO-8859-1, and breaks if there are any non-8859-1 characters present;
- Mozilla uses the low byte of each character encoded to UTF-16 code units (same as ISO-8859-1 for those characters);
- Opera and Chrome use UTF-8
- IE uses the ANSI code page on the system it’s installed on, which could be anything, but neever ISO-8859-1 or UTF-8. Characters that don’t fit the encoding are arbitrarily mangled.
“you can only use the alphabet, numbers, and spaces”
You get spaces? Luxury!
I don’t think there is a reason to not allow unicode in username. Passwords are different story, since you don’t usually see password when you type it into a form, allowing only ASCII makes sense to prevent possible confusion.
I think it makes sense to use email address as the login credential rather than requiring create a new username. Then user can select any nickname, using any unicode characters and have that nick displayed next to user’s posts and comments.
Isn’t this how it’s done on Facebook?
I think that most of the time when things (usernames or passwords) are being forced down to ASCII, it’s because someone is afraid that more complex character sets will cause breakage in some unknown component. Whether this fear is justified or not is case dependent, but trying to verify that your entire stack really does Unicode correctly in all cases might be difficult. It’s getting better every day, but you can still find problems with Unicode in some places.
I personally keep my usernames and passwords all ASCII, and I even try not to use too much punctuation. One reason is that some input devices (like some mobile phones) make it kind of difficult to get to some of the more esoteric characters. Another reason is that I’ve more than once encountered a system that had no restrictions on the password contents, but then screwed up if you actually used something other than a letter or number.
There is a risk involved if some parts of your program assume strings with different bytes are different, but other parts of the program would compare strings according to unicode semantics and think they’re the same.
For example filesystems on Mac OS X enforce uniform representation of Unicode characters, so two different filenames
Ą (‘A with ogonek’) and
̨ (latin A followed by ‘combining ogonek’) will refer to the same file.
Similarly one can produce invalid UTF-8 byte sequences where 1-byte codepoints are encoded usnig multiple bytes (called overlong sequences). If you normalize or reject UTF-8 input before processing it it’ll be safe, but e.g. if you use Unicode-ignorant programming language and Unicode-aware database these two will see different inputs.
So to avoid that:
You should filter UTF-8 input as early as possible. Reject invalid/overlong sequences.
When comparing Unicode stings always convert both sides of comparison to the same Unicode Normal Form. For usernames you might want NFKD to reduce amount of homograph attacks possible.