问题描述
我正在尝试使用javascript抓取unicode字符串.所说的字符串可以算是混合字符.示例:我的中文不好。我是意大利人。你知道吗?
I am trying to scrap a unicode string using javascript. Said string could countain mixed characters. Example: 我的中文不好。我是意大利人。你知道吗?
最终,该字符串可能包含- 中国文字-中文标点-ANSI字符和标点符号
Ultimately, the string may contain- Chinese characters- Chinese punctuation- ANSI characters and punctuation
我只需要保留汉字.有什么提示吗?
I need to leave the Chinese characters only . Any hint ?
推荐答案
您可以在 http://www.unicode.org/reports/tr38/#BlockListing 或 http://www.unicode.org/charts/.
如果要排除兼容字符(不再使用的字符)以及笔划,部首和封闭的CJK字母和月份,则以下内容应予以覆盖(我在之后添加了单独的JavaScript等效表达式):
If you are excluding compatibility characters (ones which should no longer be used), as well as strokes, radicals, and Enclosed CJK Letters and Months, the following ought to cover it (I've added the individual JavaScript equivalent expressions afterward):
- 中日韩统一表意文字(4E00-9FCC)
[\u4E00-\u9FCC]
- 中日韩统一表意文字扩展程序A(3400-4DB5)
[\u3400-\u4DB5]
- 中日韩统一表意文字扩展B(20000-2A6D6)
[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6]
- CJK统一表意文字扩展C(2A700-2B734)
\ud869[\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34]
- CJK统一表意文字扩展D(2B840-2B81D)
\ud86d[\udf40-\udfff]|\ud86e[\udc00-\udc1d]
- CJK兼容表意文字(F900-FA6D/FA70-FAD9)中的12个字符,但实际上是CJK统一表意文字
[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]
- CJK Unified Ideographs (4E00-9FCC)
[\u4E00-\u9FCC]
- CJK Unified Ideographs Extension A (3400-4DB5)
[\u3400-\u4DB5]
- CJK Unified Ideographs Extension B (20000-2A6D6)
[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6]
- CJK Unified Ideographs Extension C (2A700-2B734)
\ud869[\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34]
- CJK Unified Ideographs Extension D (2B840-2B81D)
\ud86d[\udf40-\udfff]|\ud86e[\udc00-\udc1d]
- 12 characters within the CJK Compatibility Ideographs (F900-FA6D/FA70-FAD9) but which are actually CJK unified ideographs
[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]
...因此,捕获汉字的正则表达式为:
...so, a regex to grab the Chinese characters would be:
/[\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d]/
事实上,由于有许多CJK(汉日日韩)字符,Unicode得以扩展以处理基本多语言平面"(称为星空"字符)以外的更多字符,并且由于CJK统一表意文字扩展BD是在此类星体字符的示例中,这些扩展名的范围更为复杂,因为必须在诸如JavaScript之类的UTF-16系统中使用代理对对它们进行编码.一个代理对由一个高代理和一个低代理组成,这两个代理本身都不有效,但是尽管它们的字符串长度为2,但当它们组合在一起时会形成一个实际的单个字符.
Due in fact to the many CJK (Chinese-Japanese-Korean) characters, Unicode was expanded to handle more characters beyond the "Basic Multilingual Plane" (called "astral" characters), and since the CJK Unified Ideographs extensions B-D are examples of such astral characters, those extensions have ranges that are more complicated because they have to be encoded using surrogate pairs in UTF-16 systems like JavaScript. A surrogate pair consists of a high surrogate and a low surrogate, neither of which is valid by itself but when joined together form an actual single character despite their string length being 2).
尽管出于替换目的可能更容易将其表示为非汉字(用空字符串替换),但我提供了汉字的表达式,以防万一.您需要在块中添加或删除.
While it would probably be easier for replacement purposes to express this as the non-Chinese characters (to replace them with the empty string), I provided the expression for the Chinese characters instead so that it would be easier to track in case you needed to add or remove from the blocks.
2017年9月更新
从ES6开始,可以通过使用"u"标志以及带括号的新转义序列内部的代码点来表达正则表达式,而无需使用替代词,例如,对于"CJK Unified Ideographs Extension B,/^[\u{20000}-\u{2A6D6}]*$/u
.
As of ES6, one may express the regular expressions without resorting to surrogates by using the "u" flag along with the code point inside of the new escape sequence with brackets, e.g., /^[\u{20000}-\u{2A6D6}]*$/u
for "CJK Unified Ideographs Extension B".
请注意,Unicode也已经进行了改进,以包括"CJK统一表意文字扩展E"([\u{2B820}-\u{2CEAF}]
)和"CJK统一表意文字扩展F"([\u{2CEB0}-\u{2EBEF}]
).
Note that Unicode too has progressed to include "CJK Unified Ideographs Extension E" ([\u{2B820}-\u{2CEAF}]
) and "CJK Unified Ideographs Extension F" ([\u{2CEB0}-\u{2EBEF}]
).
对于ES2018,看来Unicode属性转义符将能够进一步简化事情.根据 http://2ality.com/2017/07/regexp-unicode- property-escapes.html ,看起来将能够做到:
For ES2018, it appears that Unicode property escapes will be able to simplify things even further. Per http://2ality.com/2017/07/regexp-unicode-property-escapes.html , it looks like will be able to do:
/^(\p{Block=CJK Unified Ideographs}|\p{Block=CJK Unified Ideographs Extension A}|\p{Block=CJK Unified Ideographs Extension B}|\p{Block=CJK Unified Ideographs Extension C}|\p{Block=CJK Unified Ideographs Extension D}|\p{Block=CJK Unified Ideographs Extension E}|\p{Block=CJK Unified Ideographs Extension F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u
作为 http://unicode.org/Public/UNIDATA/PropertyAliases的短别名. txt 和 http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt也可以用于这些块,您可以将其缩短至以下(如果需要,也可以将下划线明显地更改为空格或大写): /^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u
And as the shorter aliases from http://unicode.org/Public/UNIDATA/PropertyAliases.txt and http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt can also be used for these blocks, you could shorten this to the following (and changing underscores to spaces or casing apparently too if desired): /^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u
如果我们想提高可读性,我们可以使用命名的捕获组来记录带有错误标签的兼容性字符(请参阅 http://2ality.com/2017/05/regexp-named-capture-groups.html ):
And if we wanted to improve readability, we could document the falsely labeled compatibility characters using named capture groups (see http://2ality.com/2017/05/regexp-named-capture-groups.html ):
/^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|(?<CJKFalseCompatibilityUnifieds>[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]))+$/u
根据 http://unicode.org/reports/tr44/#Unified_Ideograph就像"Unified_Ideograph"属性(别名为"UIdeo")一样,它涵盖了我们所有的统一表意符号,并且不包括符号/标点符号和兼容性字符,如果您不需要从以上各项中进行选择,以下内容可能就是全部需要:
And as it looks per http://unicode.org/reports/tr44/#Unified_Ideograph like the "Unified_Ideograph" property (alias "UIdeo") covers all of our unified ideographs and excluding symbols/punctuation and compatibility characters, if you don't need to pick and choose out of the above, the following may be all you need:
/^\p{Unified_Ideograph=yes}*$/u
或简写:
/^\p{UIdeo=y}*$/u
这篇关于Javascript unicode字符串,中文字符,但不带标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!