本文介绍了Stanford Core NLP:实体类型不确定的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Stanford Core NLP构建了一个Java解析器.我在获取与CORENLP对象一致的结果时发现一个问题.我正在为相同的输入文本获得不同的实体类型.对我来说,这似乎是CoreNLP中的一个错误.想知道是否有StanfordNLP用户遇到此问题并找到解决方法.这是我正在实例化和重用的Service类.

I had built a java parser using Stanford Core NLP. I am finding an issue in getting the consistent results with the CORENLP object. I am getting the different entity types for the same input text. It seems like a bug to me in CoreNLP. Wondering if any of the StanfordNLP users have encountered this issue and found workaround for the same. This is my Service class which I am instantiating and reusing.

    class StanfordNLPService {
        //private static final Logger logger = LogConfiguration.getInstance().getLogger(StanfordNLPServer.class.getName());
        private StanfordCoreNLP nerPipeline;
       /*
           Initialize the nlp instances for ner and sentiments.
         */
        public void init() {
            Properties nerAnnotators = new Properties();
            nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
            nerPipeline = new StanfordCoreNLP(nerAnnotators);


        }

        /**
         * @param text               Text from entities to be extracted.

         */
        public void printEntities(String text) {

            //        boolean tracking = PerformanceMonitor.start("StanfordNLPServer.getEntities");
            try {

                // Properties nerAnnotators = new Properties();
                // nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
                // nerPipeline = new StanfordCoreNLP(nerAnnotators); 
               Annotation document = nerPipeline.process(text);
                // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
                List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

                for (CoreMap sentence : sentences) {
                    for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                        // Get the entity type and offset information needed.
                        String currEntityType = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);  // Ner type
                        int currStart = token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class);    // token offset_start
                        int currEnd = token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class);        // token offset_end.
                        String currPos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);           // POS type
                        System.out.println("(Type:value:offset)\t" + currEntityType + ":\t"+ text.substring(currStart,currEnd)+"\t" + currStart);
                    }
                }
            }catch(Exception e){
                e.printStackTrace();

            }
        }

    }
Discrepancy result: type changed from MISC to O from the initial use.
Iteration 1:
(Type:value:offset) MISC:   Appropriate 100
(Type:value:offset) MISC:   Time    112
Iteration 2:
(Type:value:offset) O:  Appropriate 100
(Type:value:offset) O:  Time    112

推荐答案

我已经看过一些代码,这是解决此问题的一种可能方法:

I've looked over the code some, and here is a possible way to resolve this:

解决此问题的方法是,将useKnownLCWords设置为false的3个序列化CRF中的每一个加载,然后再次对其进行序列化.然后将新的序列化CRF提供给您的StanfordCoreNLP.

What you could do to solve this is load each of the 3 serialized CRF's with useKnownLCWords set to false, and serialize them again. Then supply the new serialized CRF's to your StanfordCoreNLP.

以下是用于加载将useKnownLCWords设置为false的序列化CRF,然后再次将其转储的命令:

Here is a command for loading a serialized CRF with useKnownLCWords set to false, and then dumping it again:

java -mx600m -cp"*:". edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier分类器/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo分类器/new.english.all.3class.distsim.crf.ser.gz

java -mx600m -cp "*:." edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo classifiers/new.english.all.3class.distsim.crf.ser.gz

输入任何您想要的名称!该命令假定您位于stanford-corenlp-full-2015-04-20/中,并且具有带有序列化CRF的目录分类器.根据您的设置进行更改.

Put whatever names you want to obviously! This command assumes you are in stanford-corenlp-full-2015-04-20/ and have a directory classifiers with the serialized CRF's. Change as appropriate for your set up.

此命令应加载序列化的CRF,使用useKnownLCWords设置为false覆盖,然后将CRF重新转储到new.english.all.3class.distsim.crf.ser.gz

This command should load the serialized CRF, override with the useKnownLCWords set to false, and then re-dump the CRF to new.english.all.3class.distsim.crf.ser.gz

然后输入原始代码:

nerAnnotators.put("ner.model","comma-separated-list-of-paths-to-new-serialized-crfs");

请让我知道这是否有效,或者我可以更深入地了解这一点!

Please let me know if this works or if it's not working, and I can look more deeply into this!

这篇关于Stanford Core NLP:实体类型不确定的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-21 16:46