我在理解斯坦福NLP工具的最新版本中对coref解析器所做的更改时遇到了一些麻烦。
例如,下面是一个句子和相应的CorefChainAnnotation:
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}
我不确定我是否理解这些数字的含义。查看源代码也无济于事。
谢谢
最佳答案
第一个数字是集群ID(代表 token ,代表同一实体),请参阅SieveCoreferenceSystem#coref(Document)
的源代码。对号超出CorefChain#toString():
public String toString(){
return position.toString();
}
其中position是一组提及的实体位置对(要使用
CorefChain.getCorefMentions()
来获取它们)。这是一个完整代码的示例(在groovy中),该代码显示如何从位置到 token :class Example {
public static void main(String[] args) {
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("dcoref.score", true);
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");
pipeline.annotate(document);
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
println aText
for(Map.Entry<Integer, CorefChain> entry : graph) {
CorefChain c = entry.getValue();
println "ClusterId: " + entry.getKey();
CorefMention cm = c.getRepresentativeMention();
println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);
List<CorefMention> cms = c.getCorefMentions();
println "Mentions: ";
cms.each { it ->
print aText.subSequence(it.startIndex, it.endIndex) + "|";
}
}
}
}
输出(我不知道“s”的来源):
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention: basic unit
Mentions: basic unit |
ClusterId: 8
Representative Mention: unit
Mentions: unit |
ClusterId: 10
Representative Mention: it
Mentions: it |