java - Stanford Core NLP-了解共指解决方案

我在理解斯坦福NLP工具的最新版本中对coref解析器所做的更改时遇到了一些麻烦。
例如，下面是一个句子和相应的CorefChainAnnotation:

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.

{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

我不确定我是否理解这些数字的含义。查看源代码也无济于事。

谢谢

最佳答案

第一个数字是集群ID(代表 token ，代表同一实体)，请参阅SieveCoreferenceSystem#coref(Document)的源代码。对号超出CorefChain＃toString():

public String toString(){
    return position.toString();
}

其中position是一组提及的实体位置对(要使用CorefChain.getCorefMentions()来获取它们)。这是一个完整代码的示例(在groovy中)，该代码显示如何从位置到 token :

class Example {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        props.put("dcoref.score", true);
        pipeline = new StanfordCoreNLP(props);
        Annotation document = new Annotation("The atom is a basic unit of matter, it   consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");

        pipeline.annotate(document);
        Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

        println aText

        for(Map.Entry<Integer, CorefChain> entry : graph) {
          CorefChain c =   entry.getValue();
          println "ClusterId: " + entry.getKey();
          CorefMention cm = c.getRepresentativeMention();
          println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);

          List<CorefMention> cms = c.getCorefMentions();
          println  "Mentions:  ";
          cms.each { it ->
              print aText.subSequence(it.startIndex, it.endIndex) + "|";
          }
        }
    }
}

输出(我不知道“s”的来源):

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention:  basic unit
Mentions:  basic unit |
ClusterId: 8
Representative Mention:  unit
Mentions:  unit |
ClusterId: 10
Representative Mention: it
Mentions: it |