问题描述
我有一些文档通过 OCR 从 PDF 转换为 HTML.正因为如此,他们最终有很多随机的 unicode 标点符号,其中转换器搞砸了(即省略号等).他们也正确地有一堆非英语,但仍然是字母字符,如é和俄语字符等......
I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messed up (i.e. elipses, etc...). They also correctly have a bunch of Non-English, but still Alphabetic characters, like é, and Russian characters, etc...
有没有办法制作一个正则表达式来匹配任何 unicode 字母字符(来自任何语言的字母)?或者只匹配非字母字符?任何一个都会非常有帮助和很棒.我正在使用 Perl,如果这有什么改变的话.谢谢!
Is there any way to make a Regex that will match any unicode alphabetic character (from alphabets of any language)? Or one that will only match non-alphabetic characters? Either one would be really helpful and awesome. I'm using Perl, if that changes anything. Thanks!
推荐答案
查看 Unicode 字符属性:http://www.regular-expressions.info/unicode.html#prop.我想你要找的可能是
Check out Unicode character properties: http://www.regular-expressions.info/unicode.html#prop. I think what you are looking for is probably
\p{L}
将匹配任何字母或表意文字.您可能还想包含带有标记的字母,因此您可以这样做
which will match any letters or ideographs. You may also want to include letters with marks on them, so you could do
\p{L}\p{M}*
无论如何,第一个链接中详细介绍了所有不同类型的字符属性.
In any case, all the different types of character properties are detailed in the first link.
您可能还想查看这个 Stack Overflow 回答,讨论 \w 是否匹配 unicode 字符.他们建议您也可以使用 \p{Word} 或 \p{Alnum}:\w 是否匹配 Unicode 标准中定义的所有字母数字字符?
You may also want to look at this Stack Overflow answer discussing whether \w matches unicode characters. They suggest that you could also use \p{Word} or \p{Alnum}: Does \w match all alphanumeric characters defined in the Unicode standard?
这篇关于有没有办法匹配任何 Unicode 字母字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!