我的代码的目的是提交文档(是pdf还是doc文件)并获取其中的所有文本。给出要由斯坦福大学nlp分析的文本。该代码工作正常。但是假设文件中有名称,例如:“ Pardeep Kumar”。收到的输出如下:
Pardeep NNP人员
Kumar NNP PERSON
但我希望它是这样的:
帕迪普·库玛NNP PERSON
我该怎么做?如何检查两个相邻的单词,它们实际上是一个名字还是类似的名字?我怎么不让他们分成不同的词?
这是我的代码:
public class readstuff {
public static void analyse(String data) {
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create an empty Annotation just with the given text
Annotation document = new Annotation(data);
// run all Annotators on this text
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
// System.out.println("word"+"\t"+"POS"+"\t"+"NER");
for (CoreMap sentence : sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(CoreAnnotations.TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
if(ne.equals("PERSON") || ne.equals("LOCATION") || ne.equals("DATE") )
{
System.out.format("%32s%10s%16s",word,pos,ne);
System.out.println();
//System.out.println(word +" \t"+pos +"\t"+ne);
}
}
}
}
public static void main(String[] args) throws FileNotFoundException, IOException, TransformerConfigurationException{
JFileChooser window=new JFileChooser();
int a=window.showOpenDialog(null);
if(a==JFileChooser.APPROVE_OPTION){
String name=window.getSelectedFile().getName();
String extension = name.substring(name.lastIndexOf(".") + 1, name.length());
String data = null;
if(extension.equals("docx")){
XWPFDocument doc=new XWPFDocument(new FileInputStream(window.getSelectedFile()));
XWPFWordExtractor extract= new XWPFWordExtractor(doc);
//System.out.println("docx file reading...");
data=extract.getText();
//extract.getMetadataTextExtractor();
}
else if(extension.equals("doc")){
HWPFDocument doc=new HWPFDocument(new FileInputStream(window.getSelectedFile()));
WordExtractor extract= new WordExtractor(doc);
//System.out.println("doc file reading...");
data=extract.getText();
}
else if(extension.equals("pdf")){
//System.out.println(window.getSelectedFile());
PdfReader reader=new PdfReader(new FileInputStream(window.getSelectedFile()));
int n=reader.getNumberOfPages();
for(int i=1;i<n;i++)
{
//System.out.println(data);
data=data+PdfTextExtractor.getTextFromPage(reader,i );
}
}
else{
System.out.println("format not supported");
}
analyse(data);
}
}
}
最佳答案
您要使用entitymentions
注释器。
package edu.stanford.nlp.examples;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class EntityMentionsExample {
public static void main(String[] args) {
Annotation document =
new Annotation("John Smith visited Los Angeles on Tuesday. He left Los Angeles on Wednesday.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
System.out.println(entityMention);
System.out.println(entityMention.get(CoreAnnotations.EntityTypeAnnotation.class));
}
}
}
}
关于java - 斯坦福大学nlp API for Java:如何获得完整的名称而不是部分,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46787542/