我下载了NER 3.4.1(于08-27-14发布)来训练特定领域的文章(技术含量很高)。
想知道以下内容:
(1)是否可以在每个提取的实体上输出偏移量?
(2)可以输出每个提取实体的置信度分数吗?
(3)我在NER3.4.1上训练了多个CRF模型,看来
Stanford GUI仅能显示一个CRF模型,是否有
显示多个CRF模型而不是编写包装的方法?
最佳答案
(1)是的,绝对如此。令牌(类:CoreLabel)返回每个令牌的每个商店的开始和结束字符偏移量。获取整个实体的偏移量的最简单方法是使用classifyToCharacterOffsets()
方法。请参见下面的示例。
(2)是的,但是在解释这些内容时有些微妙。也就是说,很多不确定性通常不是因为这三个字是应该是PERSON还是ORGANIZATION,而是组织应该是两个字长还是三个字长,等等。实际上,NER分类器正在放置概率(实际上,标签和标签序列在每个点上的分配)。您可以使用多种方法来查询这些分数。我说明了几个简单的例子,在下面将它们表示为概率。如果您想了解更多并且知道如何解释CRF,则可以获取CliqueTree作为句子并使用它来完成您想做的事情。实际上,除了做任何事情之外,通常更容易处理的只是k-最佳标签列表,每个标签都分配有完整的句子概率。我也在下面显示。
(3)对不起,没有现在的代码。这只是一个简单的演示。如果您想扩展它的功能,欢迎您。乐于收回代码贡献!
以下是发行版中NERDemo.java
的扩展版本,其中说明了其中的一些选项。
package edu.stanford.nlp.ie.demo;
import edu.stanford.nlp.ie.AbstractSequenceClassifier;
import edu.stanford.nlp.ie.crf.*;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.sequences.DocumentReaderAndWriter;
import edu.stanford.nlp.sequences.PlainTextDocumentReaderAndWriter;
import edu.stanford.nlp.util.Triple;
import java.util.List;
/** This is a demo of calling CRFClassifier programmatically.
* <p>
* Usage: {@code java -mx400m -cp "stanford-ner.jar:." NERDemo [serializedClassifier [fileName]] }
* <p>
* If arguments aren't specified, they default to
* classifiers/english.all.3class.distsim.crf.ser.gz and some hardcoded sample text.
* <p>
* To use CRFClassifier from the command line:
* </p><blockquote>
* {@code java -mx400m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier [classifier] -textFile [file] }
* </blockquote><p>
* Or if the file is already tokenized and one word per line, perhaps in
* a tab-separated value format with extra columns for part-of-speech tag,
* etc., use the version below (note the 's' instead of the 'x'):
* </p><blockquote>
* {@code java -mx400m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier [classifier] -testFile [file] }
* </blockquote>
*
* @author Jenny Finkel
* @author Christopher Manning
*/
public class NERDemo {
public static void main(String[] args) throws Exception {
String serializedClassifier = "classifiers/english.all.3class.distsim.crf.ser.gz";
if (args.length > 0) {
serializedClassifier = args[0];
}
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifier(serializedClassifier);
/* For either a file to annotate or for the hardcoded text example, this
demo file shows several ways to process the input, for teaching purposes.
*/
if (args.length > 1) {
/* For the file, it shows (1) how to run NER on a String, (2) how
to get the entities in the String with character offsets, and
(3) how to run NER on a whole file (without loading it into a String).
*/
String fileContents = IOUtils.slurpFile(args[1]);
List<List<CoreLabel>> out = classifier.classify(fileContents);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
System.out.print(word.word() + '/' + word.get(CoreAnnotations.AnswerAnnotation.class) + ' ');
}
System.out.println();
}
System.out.println("---");
out = classifier.classifyFile(args[1]);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
System.out.print(word.word() + '/' + word.get(CoreAnnotations.AnswerAnnotation.class) + ' ');
}
System.out.println();
}
System.out.println("---");
List<Triple<String, Integer, Integer>> list = classifier.classifyToCharacterOffsets(fileContents);
for (Triple<String, Integer, Integer> item : list) {
System.out.println(item.first() + ": " + fileContents.substring(item.second(), item.third()));
}
System.out.println("---");
System.out.println("Ten best");
DocumentReaderAndWriter<CoreLabel> readerAndWriter = classifier.makePlainTextReaderAndWriter();
classifier.classifyAndWriteAnswersKBest(args[1], 10, readerAndWriter);
System.out.println("---");
System.out.println("Probabilities");
classifier.printProbs(args[1], readerAndWriter);
System.out.println("---");
System.out.println("First Order Clique Probabilities");
((CRFClassifier) classifier).printFirstOrderProbs(args[1], readerAndWriter);
} else {
/* For the hard-coded String, it shows how to run it on a single
sentence, and how to do this and produce several formats, including
slash tags and an inline XML output format. It also shows the full
contents of the {@code CoreLabel}s that are constructed by the
classifier. And it shows getting out the probabilities of different
assignments and an n-best list of classifications with probabilities.
*/
String[] example = {"Good afternoon Rajat Raina, how are you today?",
"I go to school at Stanford University, which is located in California." };
for (String str : example) {
System.out.println(classifier.classifyToString(str));
}
System.out.println("---");
for (String str : example) {
// This one puts in spaces and newlines between tokens, so just print not println.
System.out.print(classifier.classifyToString(str, "slashTags", false));
}
System.out.println("---");
for (String str : example) {
System.out.println(classifier.classifyWithInlineXML(str));
}
System.out.println("---");
for (String str : example) {
System.out.println(classifier.classifyToString(str, "xml", true));
}
System.out.println("---");
int i=0;
for (String str : example) {
for (List<CoreLabel> lcl : classifier.classify(str)) {
for (CoreLabel cl : lcl) {
System.out.print(i++ + ": ");
System.out.println(cl.toShorterString());
}
}
}
System.out.println("---");
}
}
}
关于java - 斯坦福大学NER 3.4.1问题,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/27136472/