我的代码的目的是提交文档(是pdf还是doc文件)并获取其中的所有文本。给出要由斯坦福大学nlp分析的文本。该代码工作正常。但是假设文件中有名称,例如:“ Pardeep Kumar”。收到的输出如下:


  Pardeep NNP人员
  
  Kumar NNP PERSON


但我希望它是这样的:


  帕迪普·库玛NNP PERSON


我该怎么做?如何检查两个相邻的单词,它们实际上是一个名字还是类似的名字?我怎么不让他们分成不同的词?

这是我的代码:

public class readstuff {

      public static void analyse(String data) {

            // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


            // create an empty Annotation just with the given text
            Annotation document = new Annotation(data);

            // run all Annotators on this text
            pipeline.annotate(document);

            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

            // System.out.println("word"+"\t"+"POS"+"\t"+"NER");
            for (CoreMap sentence : sentences) {

                // traversing the words in the current sentence
                // a CoreLabel is a CoreMap with additional token-specific methods

                for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    // this is the text of the token
                    String word = token.get(CoreAnnotations.TextAnnotation.class);
                    // this is the POS tag of the token
                    String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                    // this is the NER label of the token
                    String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

                    if(ne.equals("PERSON") || ne.equals("LOCATION") || ne.equals("DATE") )
                    {

                        System.out.format("%32s%10s%16s",word,pos,ne);
                        System.out.println();
                    //System.out.println(word +"       \t"+pos +"\t"+ne);
                    }

                }
            }
        }

    public static void main(String[] args) throws FileNotFoundException, IOException, TransformerConfigurationException{

        JFileChooser window=new JFileChooser();
        int a=window.showOpenDialog(null);

        if(a==JFileChooser.APPROVE_OPTION){
            String name=window.getSelectedFile().getName();
            String extension = name.substring(name.lastIndexOf(".") + 1, name.length());
            String data = null;

            if(extension.equals("docx")){
                XWPFDocument doc=new XWPFDocument(new FileInputStream(window.getSelectedFile()));
                XWPFWordExtractor extract= new XWPFWordExtractor(doc);
                //System.out.println("docx file reading...");
                data=extract.getText();
                //extract.getMetadataTextExtractor();
            }
            else if(extension.equals("doc")){
                HWPFDocument doc=new HWPFDocument(new FileInputStream(window.getSelectedFile()));
                WordExtractor extract= new WordExtractor(doc);
                //System.out.println("doc file reading...");
                data=extract.getText();
            }
            else if(extension.equals("pdf")){
                //System.out.println(window.getSelectedFile());
                PdfReader reader=new PdfReader(new FileInputStream(window.getSelectedFile()));
                int n=reader.getNumberOfPages();
                for(int i=1;i<n;i++)
                {
                    //System.out.println(data);
                data=data+PdfTextExtractor.getTextFromPage(reader,i );
                }
            }
            else{
                System.out.println("format not supported");
            }

        analyse(data);
        }
    }



}

最佳答案

您要使用entitymentions注释器。

package edu.stanford.nlp.examples;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;

import java.util.*;

public class EntityMentionsExample {

  public static void main(String[] args) {
    Annotation document =
        new Annotation("John Smith visited Los Angeles on Tuesday. He left Los Angeles on Wednesday.");
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.annotate(document);

    for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
        System.out.println(entityMention);
        System.out.println(entityMention.get(CoreAnnotations.EntityTypeAnnotation.class));
      }
    }
  }
}

关于java - 斯坦福大学nlp API for Java:如何获得完整的名称而不是部分,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46787542/

10-09 15:55