11

There seem to be so many ways to create nefarious input that white-listing what input is good usually feels like the safer, simpler option.

For instance, one can fairly easily craft a white list regex that includes good things [a-zA-Z0-9], but this seems to fall apart quickly when considering international content. To clarify, the simple sample regex above would keep valid English alphabet words, but would strip out, for example, valid Spanish letters with diacritics or Chinese characters.

Is there a best practice for this type of international input validation?

AviD
  • 72,708
  • 22
  • 137
  • 218
jaketrent
  • 213
  • 2
  • 6
  • 2
    What are you protecting against? Injected code, naughty words? The above regex won't keep the valid English word "won't." – Mike Samuel Aug 10 '11 at 18:04
  • 4
    Best practices usually involve normalizing the text to [Normal Form C](http://unicode.org/reports/tr15/#Norm_Forms) before trying to validate or sanitize anything. – Mike Samuel Aug 10 '11 at 18:06
  • Human written text is nortorously difficult to filter accuratly. – this.josh Aug 10 '11 at 21:09
  • @MikeSamuel interesting link there, but I must admit I glazed over almost immediately. Do you have a more practical link for developers? And does this normalizing actually help with validation in itself? – O'Rooney Jan 29 '16 at 03:58
  • @O'Rooney, try the [Normalization FAQ](http://unicode.org/faq/normalization.html), the [Normalization section](http://websec.github.io/unicode-security-guide/character-transformations/#normalization) in the "Unicode security guide". The [Unicode Security Considerations](http://unicode.org/reports/tr36/#Text_Comparison) is also good, but a bit eye-glazy. – Mike Samuel Jan 29 '16 at 20:51
  • // , This is an excellent question, and it gets at the heart of the difficult decisions security experts have to make, and why we still need humans to make them. Well done, Jake! – Nathan Basanese Apr 12 '16 at 19:03

3 Answers3

7

That's why the character class [[:alnum:]] exists; it includes the characters which are considered valid alphanumerics in the currently active locale. Of course, that doesn't work well on a web server in the US when someone in Egypt is attempting to provide input through a form - and it doesn't work with punctuation. But it also doesn't include spaces, and that may be completely irrelevant.

---Edit--- Building on Mark's answer below and using http://www.regular-expressions.info/unicode.html as a reference, one could also use [\p{L}\p{N}] instead of the alnum character class in most common regexp implementations to recognize "all" unicode letters/numbers in all locales known to the regex engine in use. The choice basically comes down to whether the application doing the comparison knows what locale the input comes from or not. And, of course, whether the input is expected to be letters and numbers, or something else (proper names sometimes contain punctuation, for example). :) ---Edit---

To more directly answer the question - yes, a whitelist is always preferable. It's not always practical, though. Only someone familiar with the specific application can make the call as to what's actually practical.

dannysauer
  • 678
  • 4
  • 9
  • 2
    Why would a whitelist or blacklist be preferable? Can you please explain what you mean by this? I assume you're creating a list of allowed/disallowed inputs, which is a completely incorrect approach to security when it comes to databases. – Mark Buffalo Jan 29 '16 at 02:10
  • 1
    I feel you're trying to help us here, but you haven't actually managed to say much. What considerations would make it practical or not? – O'Rooney Jan 29 '16 at 03:56
  • 1
    In general, managing a whitelist is practical if the contents are easily predictable and manageable. So, using a character class in a regex could be practical, as that would typically use a locale definition that is managed system-wide. Managing a list of all US states would be practical. Managing a list of valid characters in a human name internationally would likely be impractical, as it has huge variance. Managing a blacklist of malicious web sites would be impractical as is changes and grows very rapidly. – dannysauer Jan 29 '16 at 04:08
  • My answer was admittedly a short response to "should I use a white list or black list", and a suggestion for the specific technical question asked. Depending on the specific situation, there are certainly lots more things to keep in mind, but three years ago I apparently chose not to speculate on those. :) – dannysauer Jan 29 '16 at 04:16
  • @dannysauer I actually had a discussion this morning with a friend about whitelisting. We kept arguing for the same thing (kind of like what we did), but using different terminology. I called it format checking, and he called it whitelisting. I think that's part of what happened here. Yes, whitelisting may be okay in some situations, but what whitelisting is may need to be clarified. Back in 2011, nearly no one used prepared statements. I had my own terminology since I had not been taught this one previously... but already understood the same concept. My apologies. – Mark Buffalo Jan 29 '16 at 14:20
  • Maybe the whitelist is a regex or similar pattern match, maybe it's a literal list of words. Conceptually, though, the end result is a comprehensive list of what is ok. :) The general idea is similar to the principle of least privilege; you only allow the minimum which is necessary, and deny everything else. But that's an idealistic view. If input comes from a discrete set (time of day, for example), it might be more logical to deny time periods from noon to 1 pm than to allow both midnight to noon and 1 to midnight. – dannysauer Jan 29 '16 at 23:44
5

Assuming you're asking this in the context of Web Development...

You can detect appropriate character sets with simple regex validation. However, you may also be falling victim to security theater: input sanitation is not the answer.

If you are trying to validate for specific locales, and you don't want to accept any other locales, you can choose specific ones using Regex. Here's an example:

  1. \p{InHan} for Chinese characters.
  2. \p{InArabic} for Arabic
  3. \p{InThai} for Thai

However, I'm with O'Rooney here: you should accept everything (as long as it's validated: length, null, format, whitelist), and use Prepared Statements with output sanitation.


Warnings About Language-based Whitelisting

If you insist on going with a unicode-range-based whitelist, then please keep in mind that you should still allow [a-zA-Z0-9], even though you're accepting only other locales. On the Chinese internet, people frequently type with English letters. For example, they may attempt to evade censorship by abbreviating characters (just text on wikipedia, but still NSFW). Many people also use pinyin and roman numerals.

You can also use Unicode ranges, but when you are using combined ideographs/language sets such as CJK (Chinese, Japanese, and Korean; I do believe \p{IsHan} is CJK) then, you will run into many validation issues.

If you want to exclude by language, you will have trouble with this concept when you're expecting Japanese input, but instead get Chinese input, or vice versa. The same concept applies with Korean against Chinese, or Japanese. You will need to find the appropriate unicode ranges, but note that some languages occasionally overlap: Chinese (Hanzi) and Japanese (Kanji) share some characters.

Because you're worried about accepted input, it sounds like you're looking for input sanitation. This is the wrong approach. You should not be "sanitizing" input that goes into a database. Whitelisting is fine (acceptable values, for example).

Sanitizing and Validating Input are two different things. What's the difference?

  1. Sanitizing input could look like this: stripApostrophesFromString(input);
  2. Input validation could look like this: if (input != null && input.Length == acceptableNumber && regexFormatIsValid(input) && isWithinAcceptableRanges(input)) { } else { }

For character-set validation, a variation of the listed regexes could suffice, but will not validate length, format, etc. If you're worried about SQL injection (and you should be), you should be using prepared statements with output sanitation.

Output sanitation is essentially converting bad characters, such as script tags, to their equivalent HTML entity. For example, < becomes &lt;, and > becomes &gt;.

Mark Buffalo
  • 22,508
  • 8
  • 74
  • 91
  • 1
    It's irresponsible to say that input sanitization should never be done. You've made assumptions about the application which are not stated in the question, including both the regex format and the desired intent. The question simply asks how to verify that characters are legitimate "word" characters, which is presumably useful in the poster's situation. – dannysauer Jan 29 '16 at 02:04
  • 1
    Input sanitation should almost never be done. It's very difficult to get wrong, especially in the case of unicode-based smuggling. In fact, I find it shocking that anyone would advocate for it in nearly all conditions. Keep in mind, `input validation` and `input sanitation` are too entirely different concepts. These assumptions are here as a warning. – Mark Buffalo Jan 29 '16 at 02:09
  • Sanitization should definitely not be used as a stand-alone security measure; on that I think we can agree. However, it's a perfectly valid thing to do as a first-line component of a defense in depth approach. Outside of security concerns, there are definitely benefits to sanitization from a user experience perspective in some applications. Just take trimming trailing spaces from a username as one common example; it's often better to silently do that than to reject a form and make the user do it. – dannysauer Jan 29 '16 at 02:16
  • 3
    In database/web dev, sanitation is not a valid thing to do in most cases. It's a support nightmare (removing apostrophes from last names, words, etc) that can lead to many different issues. It may not protect against unicode based smuggling in many cases (not all use regex), and because of the countless bugs, it can reduce security. And in your example, input validation would fix that. It would not be better to do that silently due to potential support nightmares. For example: `Rory O'Cune` becomes `Rory OCune`, and now support has to account for his truncated name before searching a database. – Mark Buffalo Jan 29 '16 at 02:21
  • While yes, you can strip spaces from the username, I feel you should display a warning to the user: `Spaces are not allowed`. Users should know why they aren't allowed to do something. By the way, I'm only arguing within the scope of web development and databases. – Mark Buffalo Jan 29 '16 at 02:23
  • Please investigate these subjects: `Prepared Statements`, `Output Sanitation`, `XSS`. – Mark Buffalo Jan 29 '16 at 02:25
  • You're arguing things that aren't part of the question. Yes, if a developer isn't sufficiently familiar with the data set, a list might be wtong. Yes, it's easy to get it wrong if used as a security control. In the limited context of web apps that get people's names and insert into databases, it's hard to validate names correctly to prevent SQL injection. But that's not what was asked, and has nothing to do with cross site scripting. – dannysauer Jan 29 '16 at 02:38
  • @dannysauer "`it's hard to validate names correctly to prevent SQL injection`"? I'm sorry but this is incredibly incorrect. No, it isn't hard. Look up `Prepared Statements`. You simply validate that your input matches the correct parameters (is not null, is within the size constraints of the database column, doesn't have characters that can't be entered into the column, etc), and then you prepare the statement and pass the input as a parameter. No SQL injection occurs *unless* you're using string concatenation. – Mark Buffalo Jan 29 '16 at 02:44
  • 1
    Sometimes it makes sense to alert the user to an error (validaye), like removing the apostrophe from a name. Sometimes it makes sense to silently correct (sanitize) the input, like the trailing space on a login page. It depends on the audience and application. – dannysauer Jan 29 '16 at 02:45
  • It's hard to validate names to prevent injection /as a developer incorrectly using regexps to sanitize inputs./ I thought that context was implicit in a discussion on regexps and input sanitization, but making that assumption in context was perhaps my error. – dannysauer Jan 29 '16 at 03:08
  • **Input validation:** `if (input != null && input.Length == acceptableNumber && regexFormatIsValid(input))` (followed by PreparedStatements without string concatenation). **Input sanitation:** `stripCharactersFromInput(input);` – Mark Buffalo Jan 29 '16 at 03:11
  • 1
    @dannysauer I hope I didn't come across as too harsh. That is not my intent. Rather, I would like to stop seeing input sanitation everywhere, so I get passionate about this. Many of the concepts you're referring to were the prevailing wisdom at the time, but they are no longer "valid --" not that they ever were. Many of us, including myself, just didn't know the right way to do things. I'm sure you're pretty smart and know your trade, but on this security feature, I need to respectfully disagree with you. – Mark Buffalo Jan 29 '16 at 03:22
  • 3
    I think we agree more than you realize. You're completely correct in the context you've focused upon (though I still prefer the more portable alnum class). My point is solely that the question was too generic to make the assumptions necessary to say that sanitization is definitely, universally wrong. Because I don't know exactly what context the question poster is asking in, I can't comfortably take that position. – dannysauer Jan 29 '16 at 03:28
  • @dannysauer I've gone over the posts and have decided to delete my other comment. It's good we can agree on this. I agree with you that he's being very generic. However, input validation is almost universally better than input sanitation. Properly implemented input validation prevents the wrong data from getting through, and will prevent buffer overflows. – Mark Buffalo Jan 29 '16 at 03:34
  • 1
    For the record, I do see where you're coming from. A decade ago, I was more active in the PHP community, which is a great place to see some truly awful web code. Earlier this month, I was cleaning up a peer's perl database app that used string concatenation using unsanitized user input! And we're primarily security analysts. :/ So your passion in this topic is definitely not lost on me. And I'll capitulate that your note about potential pitfalls is useful given that we are, after all, on the info sec site. ;) – dannysauer Jan 29 '16 at 03:40
  • Also, have you tried your suggested regex against Chinese characters? I'm getting precisely zero matches. 为什么啊 – Mark Buffalo Jan 29 '16 at 03:44
  • 1
    I just tested it. If you set the locale to one which matches (which I guessed to be zh_CN.utf8), gnu grep hits all 4 of those chars. `sauer@krieger:~/dev$ LC_ALL=zh_CN.utf-8 grep --color '[[:alnum:]]' <( echo '为什么啊' )` returns `为什么啊`. I had to regenerate the locale on my Ubuntu machine (since I had removed all non-English locales to save space) by running `sudo locale-gen zh_CN.UTF-8`. – dannysauer Jan 29 '16 at 04:01
  • Yeah, that's the problem: what if you have a locale set to English? While mine is set to Chinese, this won't work unless you set the locale first. I don't believe this is the correct approach. – Mark Buffalo Jan 29 '16 at 04:20
  • 1
    It depends on the application. If you're running on the client (like maybe Javascript or a local app), then you should generally be able assume the client has set the locale they want to work in correctly. With web apps, you can often use the Accept-Language header to identify the locale (or locales) likely to have been used by the user. In general, I prefer the approach of letting the user guide things like this /when possible/. :) If it's not possible, then you have to look for a workaround which probably brings its own challenges. – dannysauer Jan 29 '16 at 04:28
4

Our answer is that for a truly international application, on general input such as people's names, you should accept everything and encode it at display time. Admittedly that (to some extent) passes the problem down to the guy writing the Encode algorithm.

However, if you have an input that is a specific thing, such as a vehicle number plate, or a business identification code, then you should validate it against those rules, regardless of being an international application. Again, a further caveat is that those rules might still be difficult to define, for example number plates symbols would vary with country.

(Edit) Why I prefer encoding over validation:

At the time of validation, that data could go potentially anywhere: a CSV text file, an SQL query, a web page, a config setting. You don't know, and cannot know, what the risky characters are.

At the time of encoding, by definition you know where the data is going, so you can then definitively encode the risky characters.

O'Rooney
  • 235
  • 1
  • 8
  • 2
    I'm with you on most of what you said. Encoding at display time is a very good idea, yes. However, input validation is still necessary: null-checking, length-checking, format checking, etc. – Mark Buffalo Jan 29 '16 at 02:49
  • Mark, thanks for all the excellent input. Then the question for me becomes, how should one do the "format checking"? - it seems like an extremely difficult problem for an international application. – O'Rooney Jan 29 '16 at 03:45
  • http://www.regular-expressions.info/unicode.html is a good start for creating a regex - once you've figured out what "kind" of characters you expect. :) – dannysauer Jan 29 '16 at 04:24
  • Format checking is very broad/vague. For example, you may want your input to be: `你好! 你在哪里吗?` (Hello! where are you?). Let's say your format is "Greeting! Question?" You would regex for that sort of like this (rough example): `^(\p{InHan}*)\uFF01 (\p{InHan}*)\uFF1F$`. It really depends on your specific format. The parenthesis splits it into capture groups so you can use a proper regex split. Kind of beyond the scope of commenting, and this regex might not work. :b – Mark Buffalo Jan 29 '16 at 04:35
  • Another example (in English): Let's say your input is the following format: `Word 1-28-2016 !`. For whatever reason, you need that format exactly. Who knows why you need that kind of input? Well, a regex format for that would be something like this: `^(\w*)\s(\d*)-(\d*)-(\d*)\s!$`. You'd test to see if the regex is valid before continuing. Again, the parenthesis are capture groups for regex splitting certain regions into an array. – Mark Buffalo Jan 29 '16 at 04:44
  • Yes, thanks guys, but I think I wasn't clear. I *don't* know what would go in these fields - they are just free-form input fields for people to put names that mean things to them - personal or business names for example. In any language and character set. Is that a solvable problem? I think not. – O'Rooney Jan 31 '16 at 21:33