问题描述
考虑此段文字
在上面,它很容易在句点(.)上分割句子,但是在美国处理句点时会导致错误的结果.假设我有一个缩写列表,例如
In the above, its easy to split sentences over period(.) but it will lead to incorrect results when it processes the period in U.S.A. . Assume I have a list of abbreviations such as
String abbrev[] ={"u.s.a", "u.a.e", "u.k", "p.r.c","u.s.s.r", };
String regex= "\\.";
Pattern pattern = Pattern.compile(regex,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sx);
int beginIndex=0;
// Check all occurance
int index=0;
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
String group=matcher.group();
System.out.println("group: " + group);
int dotIndex= group.indexOf(".");
String sub= sx.substring(beginIndex, matcher.start()+dotIndex);
beginIndex= matcher.start()+dotIndex;
System.out.println(sub);
}
我可以使用dotIndex周围的所有缩写进行强力匹配.有更好的方法吗?
I could do a brute force match with all the abbreviations around dotIndex. Is there a better approach ?
推荐答案
我最好的猜测是:(?<!\.[a-zA-Z])\.(?![a-zA-Z]\.)
会转换为:
My best guess would be something like: (?<!\.[a-zA-Z])\.(?![a-zA-Z]\.)
which would translate to:
(?<!\.[a-zA-Z]) # can't be preceded by a period followed by a single letter
\.
(?![a-zA-Z]\.) # nor can it be followed by a letter and another preiod
然后您可以从此处执行替换.
Then you can perform the replace from there.
如果您需要在报价内捕捉句点,则需要付出更多的努力,而在上述模式中没有说明了这一点.
This would require a lot more effort if you needed to catch period within quotes though, which is not accounted for in the above pattern.
这篇关于除非句号是缩写的一部分,否则如何将段落拆分成由句点(.)分隔的句子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!