问题描述
我正在开发一个程序,其中需要过滤非拉丁字符的单词和句子.问题是,我只找到拉丁字符单词和句子,但没有找到混合了拉丁字符和非拉丁字符的单词和句子.例如,"Hello"是拉丁字母,我可以使用以下代码对其进行匹配:
I am developing a program ,where I need to filter words and sentences which are non-Latin character. The problem is, that I found only Latin character words and sentences , but I do not found words and sentences which are mixed with Latin characters and non-Latin characters. For example, "Hello" is Latin letter word, and I can match it using this code:
Match match = Regex.Match(line.Line, @"[^\u0000-\u007F]+", RegexOptions.IgnoreCase);
if (match.Success)
{
line.Line = match.Groups[1].Value;
}
但是我没有发现例如混有非拉丁字母的单词或句子:HelløI amsømthing".
But I do not found for example mixed with non-Latin letter word or sentences : "Hellø I am sømthing" .
还可以有人解释什么是RegexOptions.None或RegexOptions.IgnoreCase以及它们代表什么吗?
Also, could somebody explain what is RegexOptions.None or RegexOptions.IgnoreCase and for what they stand for?
推荐答案
四个拉丁"字母块是(来自 http://www.fileformat.info/info/unicode/块/index.htm ):
The four "Latin" blocks are (from http://www.fileformat.info/info/unicode/block/index.htm):
Latin-1补充剂U + 0080-U + 00FF
Latin-1 Supplement U+0080 - U+00FF
拉丁扩展A U + 0100-U + 017F
Latin Extended-A U+0100 - U+017F
拉丁扩展B U + 0180-U + 024F
Latin Extended-B U+0180 - U+024F
因此要包含"一个正则表达式他们都是:
So a Regex to "include" all of them would be:
Regex.Match(line.Line, @"[\u0000-\u024F]+", RegexOptions.None);
正则表达式可以捕获块外的任何内容:
while a Regex to catch anything outside the block would be:
Regex.Match(line.Line, @"[^\u0000-\u024F]+", RegexOptions.None);
请注意,我确实觉得按块"进行正则表达式非常有用.有点错误,尤其是在使用拉丁语块时,因为例如在基本拉丁语块中,您具有控制字符(如换行符,...),字母(AZ,az),数字(0-9),标点符号(.,;:...),其他字符($ @/& ...)等.
Note that I do feel that doing a regex "by block" is a little wrong, especially when you use the Latin blocks, because for example in the Basic Latin block you have control characters (like new line, ...), letters (A-Z, a-z), numbers (0-9), punctation (.,;:...), other characters ($@/&...) and so on.
对于RegexOptions.None
和RegexOptions.IgnoreCase
-
他们的名字很清楚
Their name is quite clear
您可以尝试在MSDN上搜索它们
you could try googling them on MSDN
来自 https://msdn.microsoft. com/en-us/library/system.text.regularexpressions.regexoptions.aspx :
RegexOptions.IgnoreCase:指定不区分大小写的匹配.
RegexOptions.IgnoreCase: Specifies case-insensitive matching.
最后一个表示如果您执行Regex.Match(line.Line, @"ABC", RegexOptions.IgnoreCase)
,它将匹配ABC
,Abc
,abc
,...,并且此选项即使在像[A-Z]
这样的字符范围都将匹配两个和a-z
.注意,在这种情况下它可能是无用的,因为我建议的块应同时包含大写和小写的变体".大写和小写字母的组合.
the last one means that if you do Regex.Match(line.Line, @"ABC", RegexOptions.IgnoreCase)
it will match ABC
, Abc
, abc
, ... And this option works even on character ranges like [A-Z]
that will match both A-Z
and a-z
. Note that it is probably useless in this case because the blocks I suggested should contain both the uppercase and the lowercase "variation" of letters that are both uppercase and lowercase.
这篇关于正则表达式拉丁字符过滤器和非拉丁字符过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!