本文介绍了除非句号是缩写的一部分,否则如何将段落拆分成由句点(.)分隔的句子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑此段文字

在上面,它很容易在句点(.)上分割句子,但是在美国处理句点时会导致错误的结果.假设我有一个缩写列表,例如

In the above, its easy to split sentences over period(.) but it will lead to incorrect results when it processes the period in U.S.A. . Assume I have a list of abbreviations such as

String abbrev[] ={"u.s.a", "u.a.e", "u.k", "p.r.c","u.s.s.r", };
String regex= "\\.";
Pattern pattern = Pattern.compile(regex,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sx);
int beginIndex=0;

// Check all occurance
int index=0;
while (matcher.find()) {
    System.out.print("Start index: " + matcher.start());
    System.out.print(" End index: " + matcher.end() + " ");

    String group=matcher.group();
    System.out.println("group: " + group);
    int dotIndex= group.indexOf(".");
    String sub= sx.substring(beginIndex, matcher.start()+dotIndex);
    beginIndex= matcher.start()+dotIndex;

    System.out.println(sub);
}            

我可以使用dotIndex周围的所有缩写进行强力匹配.有更好的方法吗?

I could do a brute force match with all the abbreviations around dotIndex. Is there a better approach ?

推荐答案

我最好的猜测是:(?<!\.[a-zA-Z])\.(?![a-zA-Z]\.)会转换为:

My best guess would be something like: (?<!\.[a-zA-Z])\.(?![a-zA-Z]\.) which would translate to:

(?<!\.[a-zA-Z])    # can't be preceded by a period followed by a single letter
\.
(?![a-zA-Z]\.)     # nor can it be followed by a letter and another preiod

然后您可以从此处执行替换.

Then you can perform the replace from there.

演示

如果您需要在报价内捕捉句点,则需要付出更多的努力,而在上述模式中没有说明了这一点.

This would require a lot more effort if you needed to catch period within quotes though, which is not accounted for in the above pattern.

这篇关于除非句号是缩写的一部分,否则如何将段落拆分成由句点(.)分隔的句子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 16:59