问题描述
有人知道用于处理句子边界的Java库吗?我认为这将是一个聪明的StringTokenizer实现,它知道语言可以使用的所有句子终止符.
Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.
这是我使用BreakIterator的经验
Here's my experience with BreakIterator:
使用示例此处:我有以下日语:
Using the example here:I have the following Japanese:
今日はパソコンを買った。高性能のマックは早い!とても快適です。
在ascii中,它看起来像这样:
In ascii, it looks like this:
\ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002
这是我更改的示例的一部分: 静态空句子Examples(){
Here's the part of that sample that I changed: static void sentenceExamples() {
Locale currentLocale = new Locale ("ja","JP");
BreakIterator sentenceIterator =
BreakIterator.getSentenceInstance(currentLocale);
String someText = "今日はパソコンを買った。高性能のマックは早い!とても快適です。";
当我查看边界索引时,我会看到:
When I look at the Boundary indices, I see this:
0|13|24|32
但是这些索引不对应任何句子终止符.
But those indices don't correspond to any sentence terminators.
推荐答案
您写道:
这里的一个基本问题是句子终止词取决于上下文,请考虑:
A basic problem here is that sentence terminators depend on the context, consider:
这应该被识别为单个句子,但是如果您仅对可能的句子终止符进行拆分,您将获得三个句子.
This should be recognized as a single sentence, but if you just split on possible sentence terminators you will get three sentences.
因此,这是一个比开始时想的要复杂的问题.可以使用机器学习技术来解决.例如,您可以查看 OpenNLP 项目,尤其是 SentenceDetectorME 类.
So this is a more complex problem than one might think in the beginning. It can be approached using machine learning techniques. You could for instance look into the OpenNLP project, in particular the SentenceDetectorME class.
这篇关于查找句子边界的Java库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!