问题描述
最近,由于浏览器支持的数据质量,我遇到了一个错误,我正在寻找一个安全的规则来应用不需要双倍大小的字符串转义。
Recently I hit a bug due to data quality with browser support, and I am looking for a safe rule for applying string escape without double size unless required.
UTF-8字节序列E2-80-A8(U + 2028,LINE SEPARATOR)是Unicode数据库中完全有效的字符。但是,该顺序代表一个行分隔符(是的,然后是0A)。
A UTF-8 byte sequence "E2-80-A8" (U+2028, LINE SEPARATOR), a perfectly valid character in a Unicode database. However, that sequence represents a line-separator (Yes, other then "0A").
很糟糕的是,很多浏览器(包括Chrome,Firefox和Safari; '测试其他人),无法处理一个包含该Unicode字符的字符串的JSONP回调。 JSONP被包含在非Unicode HTML中,我没有任何控制。
And badly, many browser (including Chrome, Firefox, and Safari; I didn't test others), failed to process a JSONP callback which has a string that contains that Unicode character. The JSONP was included by a non-Unicode HTML which I did not have any control.
浏览器简单地报告了这样一个JavaScript上的INVALID CODE /语法错误,从debug中看起来很有效工具和所有文本编辑器。我猜想,它可能会尝试将E2-80-A8转换为BIG-5,并破坏了JS语法。
The browsers simply reported INVALID CODE/syntax error on such JavaScript which looks valid from debug tools and all text editors. What I guess is that it may try to convert "E2-80-A8" to BIG-5 and broke JS syntax.
以上只是Unicode的一个例子可以打破你的系统意外。据我所知,一些黑客可以使用RTL和其他控制字符来实现。在Unicode规范中还有许多引号,空格,符号和控件。
The above is only an example of how Unicode can break your system unexpected. As far as I know, some hacker can use RTL and other control characters for their good. And there are many "quotes", "spaces", "symbols" and "controls" in Unicode specification.
QUESTION:
是否有一个Unicode字符的列表,每个程序员都知道隐藏的功能(和错误),我们可能不希望它们在我们的应用程序中有效。 (例如Windows禁用文件名中的RTL)。
Is there a list of Unicode characters for every programmer to know about hidden features (and bugs) which we might not want them effective in our application. (e.g. Windows disable RTL in filename).
编辑:
我不是要求JSON或JavaScript。我要求所有程序中的Unicode处理的一般最佳做法。
I am not asking for JSON nor JavaScript. I am asking for general best practice of Unicode handing in all programs.
推荐答案
有一个字符属性数据库和一个描述它的报告,,可以很好地了解浏览器应该如何处理代码点。我喜欢这个词,应该。最安全的将是白名单,你可以随着L | M | N | S,信件或标记或号码或符号去。
There's a database of character properties and a report describing it, the UNICODE CHARACTER DATABASE, that gives a good idea of how browsers "should" treat a code point. I love that word, "should". Safest is going to be a whitelist, you could probably go with L|M|N|S, Letter or Mark or Number or Symbol.
看看 ICU项目
这篇关于在输出中应该过滤的Unicode字符列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!