Assuming you're asking this in the context of Web Development...
You can detect appropriate character sets with simple regex validation. However, you may also be falling victim to security theater: input sanitation is not the answer.
If you are trying to validate for specific locales, and you don't want to accept any other locales, you can choose specific ones using Regex. Here's an example:
\p{InHan}
for Chinese characters.
\p{InArabic}
for Arabic
\p{InThai}
for Thai
However, I'm with O'Rooney here: you should accept everything (as long as it's validated: length, null, format, whitelist), and use Prepared Statements
with output sanitation
.
Warnings About Language-based Whitelisting
If you insist on going with a unicode-range-based whitelist, then please keep in mind that you should still allow [a-zA-Z0-9]
, even though you're accepting only other locales. On the Chinese internet, people frequently type with English letters. For example, they may attempt to evade censorship by abbreviating characters (just text on wikipedia, but still NSFW). Many people also use pinyin and roman numerals.
You can also use Unicode ranges, but when you are using combined ideographs/language sets such as CJK
(Chinese, Japanese, and Korean; I do believe \p{IsHan}
is CJK
) then, you will run into many validation issues.
If you want to exclude by language, you will have trouble with this concept when you're expecting Japanese input, but instead get Chinese input, or vice versa. The same concept applies with Korean against Chinese, or Japanese. You will need to find the appropriate unicode ranges, but note that some languages occasionally overlap: Chinese (Hanzi) and Japanese (Kanji) share some characters.
Because you're worried about accepted input, it sounds like you're looking for input sanitation. This is the wrong approach. You should not be "sanitizing" input that goes into a database. Whitelisting is fine (acceptable values, for example).
Sanitizing and Validating Input are two different things. What's the difference?
- Sanitizing input could look like this:
stripApostrophesFromString(input);
- Input validation could look like this:
if (input != null && input.Length == acceptableNumber && regexFormatIsValid(input) && isWithinAcceptableRanges(input)) { } else { }
For character-set validation, a variation of the listed regexes could suffice, but will not validate length, format, etc. If you're worried about SQL injection (and you should be), you should be using prepared statements
with output sanitation
.
Output sanitation is essentially converting bad characters, such as script tags, to their equivalent HTML entity. For example, <
becomes <
, and >
becomes >
.