问题描述
我已经使用Stanford Core NLP构建了一个Java解析器.我在获取与CORENLP对象一致的结果时发现一个问题.我正在为相同的输入文本获得不同的实体类型.对我来说,这似乎是CoreNLP中的一个错误.想知道是否有StanfordNLP用户遇到此问题并找到解决方法.这是我正在实例化和重用的Service类.
I had built a java parser using Stanford Core NLP. I am finding an issue in getting the consistent results with the CORENLP object. I am getting the different entity types for the same input text. It seems like a bug to me in CoreNLP. Wondering if any of the StanfordNLP users have encountered this issue and found workaround for the same. This is my Service class which I am instantiating and reusing.
class StanfordNLPService {
//private static final Logger logger = LogConfiguration.getInstance().getLogger(StanfordNLPServer.class.getName());
private StanfordCoreNLP nerPipeline;
/*
Initialize the nlp instances for ner and sentiments.
*/
public void init() {
Properties nerAnnotators = new Properties();
nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
nerPipeline = new StanfordCoreNLP(nerAnnotators);
}
/**
* @param text Text from entities to be extracted.
*/
public void printEntities(String text) {
// boolean tracking = PerformanceMonitor.start("StanfordNLPServer.getEntities");
try {
// Properties nerAnnotators = new Properties();
// nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
// nerPipeline = new StanfordCoreNLP(nerAnnotators);
Annotation document = nerPipeline.process(text);
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// Get the entity type and offset information needed.
String currEntityType = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); // Ner type
int currStart = token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class); // token offset_start
int currEnd = token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class); // token offset_end.
String currPos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); // POS type
System.out.println("(Type:value:offset)\t" + currEntityType + ":\t"+ text.substring(currStart,currEnd)+"\t" + currStart);
}
}
}catch(Exception e){
e.printStackTrace();
}
}
}
Discrepancy result: type changed from MISC to O from the initial use.
Iteration 1:
(Type:value:offset) MISC: Appropriate 100
(Type:value:offset) MISC: Time 112
Iteration 2:
(Type:value:offset) O: Appropriate 100
(Type:value:offset) O: Time 112
推荐答案
我已经看过一些代码,这是解决此问题的一种可能方法:
I've looked over the code some, and here is a possible way to resolve this:
解决此问题的方法是,将useKnownLCWords设置为false的3个序列化CRF中的每一个加载,然后再次对其进行序列化.然后将新的序列化CRF提供给您的StanfordCoreNLP.
What you could do to solve this is load each of the 3 serialized CRF's with useKnownLCWords set to false, and serialize them again. Then supply the new serialized CRF's to your StanfordCoreNLP.
以下是用于加载将useKnownLCWords设置为false的序列化CRF,然后再次将其转储的命令:
Here is a command for loading a serialized CRF with useKnownLCWords set to false, and then dumping it again:
java -mx600m -cp"*:". edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier分类器/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo分类器/new.english.all.3class.distsim.crf.ser.gz
java -mx600m -cp "*:." edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo classifiers/new.english.all.3class.distsim.crf.ser.gz
输入任何您想要的名称!该命令假定您位于stanford-corenlp-full-2015-04-20/中,并且具有带有序列化CRF的目录分类器.根据您的设置进行更改.
Put whatever names you want to obviously! This command assumes you are in stanford-corenlp-full-2015-04-20/ and have a directory classifiers with the serialized CRF's. Change as appropriate for your set up.
此命令应加载序列化的CRF,使用useKnownLCWords设置为false覆盖,然后将CRF重新转储到new.english.all.3class.distsim.crf.ser.gz
This command should load the serialized CRF, override with the useKnownLCWords set to false, and then re-dump the CRF to new.english.all.3class.distsim.crf.ser.gz
然后输入原始代码:
nerAnnotators.put("ner.model","comma-separated-list-of-paths-to-new-serialized-crfs");
请让我知道这是否有效,或者我可以更深入地了解这一点!
Please let me know if this works or if it's not working, and I can look more deeply into this!
这篇关于Stanford Core NLP:实体类型不确定的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!