问题描述
在接受,存储,处理和显示Unicode文本的应用程序中(出于讨论目的,我们假设它是一个Web应用程序),应始终从中删除哪些字符 收到文字?
In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?
我可以想到一些内容,大部分都列在 C0和C1控制码Wikipedia文章中:
I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:
-
范围
0x00
-0x19
(主要是控制字符),但不包括0x09
(制表符),0x0A
(LF)和0x0D
(CR)
The range
0x00
-0x19
(mostly control characters), excluding0x09
(tab),0x0A
(LF), and0x0D
(CR)
范围0x7F
-0x9F
(更多控制字符)
The range 0x7F
-0x9F
(more control characters)
可以安全地接受的字符范围会更好.
Ranges of characters that can safely be accepted would be even better to know.
还有其他级别的文本过滤功能-可以规范化具有多种表示形式的字符,替换不间断字符并删除零宽度字符-但我主要对基础知识感兴趣.
There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.
推荐答案
请参阅W3 Unicode XML和其他标记语言注释.它将一类字符定义为不适合在标记中使用",我肯定会在大多数网站中将其过滤掉.它特别包括以下字符:
See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I'd definitely filter out for most web sites. It notably includes such characters as:
-
U + 2028–9是时髦的换行符,如果您尝试在字符串文字中使用它们,将会使JavaScript迷惑;
U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;
U + 202A–E,这是比迪控制代码,用户可以巧妙地插入它们,以使文本在某些浏览器中甚至向后运行,甚至在给定的HTML元素之外;
U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;
语言替代控制代码,它们的范围也可能超出元素;
language override control codes that could also have scope outside of an element;
BOM.
此外,您还想过滤/替换完全在Unicode中无效的字符(U + FFFF等),并且,如果您使用的是本机可用于UTF-16的语言(例如Java), ,Windows上的Python),任何不能形成有效代理对的代理字符(U + D800–U + DFFF).
Additionally, you'd want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.
并且可以说(对于Web应用程序尤其如此),它也失去了CR,并将制表符变成空格.
And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.
是的,除那些可能真的是真的的人以外,请不要使用它们. (SO以前允许它们使用,允许人们发布被误解码的字符串,这有时对于诊断Unicode问题很有用.)对于大多数网站,我认为您不希望使用它们.
Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you'd not want them.
这篇关于最低限度的文字卫生的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!