本文介绍了如何在StanfordNLP中修改TokenRegex规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有tokenregex的规则文件为

I have rule file for tokenregex as

$EDU_FIRST_KEYWORD = (/Education/|/Course[s]?/|/Educational/|/Academic/|/Education/ /and/?|/Professional/|/Certification[s]?/ /and/?)

$EDU_LAST_KEYWORD = (/Background/|/Qualification[s]?/|/Training[s]?/|/Detail[s]?/|/Record[s]?/)tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ ruleType: "tokens", pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?), result: "EDUCATION"}

{ ruleType: "tokens", pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?), result: "EDUCATION"}

我要匹配EDU_FIRST_KEYWORD,然后再匹配EDU_LAST_KEYWORD.如果两个部分都不匹配,请检查EDU_FIRST_KEYWORD是否在给定的字符串中匹配.

I want to match EDU_FIRST_KEYWORD followed by EDU_LAST_KEYWORD. If it does not match both parts, then check if EDU_FIRST_KEYWORD matches in given string.

例如1.培训与培训课程

E.g. 1. Training & Courses

匹配的输出:教育(因为它与课程匹配,这不应该发生)

Matched Output: EDUCATION (as it matched Courses, which should not happen)

预期输出:无输出

这是因为它与字符串的第一部分或完整的字符串都不匹配.

  1. 教育背景

匹配的输出:教育

预期产量:教育

我尝试将pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?)更改为pattern: ( $EDU_FIRST_KEYWORD + $EDU_LAST_KEYWORD ?),但无济于事.

I tried changing pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?) to pattern: ( $EDU_FIRST_KEYWORD + $EDU_LAST_KEYWORD ?) but it does not help.

我尝试了stanfordNLP tokenregex文档,但无法获得实现方法.有人可以帮我更改规则文件吗?预先感谢.

I tried stanfordNLP tokenregex documentation, but could not get how to achieve this. Can somebody help me changing rule file?Thanks in advance.

推荐答案

您想使用TokenSequenceMatcher的matches()方法使您的规则针对整个String运行.

You want to use the matches() method of TokenSequenceMatcher to have your rule run against the entire String.

如果使用find(),它将搜索整个字符串...如果使用matches(),它将查看整个字符串是否与模式匹配.

If you use find() it will search the entire string...if you use matches() it will see if the entire string matches the pattern.

目前,我不确定TokensRegexAnnotator是否可以对句子执行完整的字符串匹配,因此您可能需要使用如下代码:

At this time I am not sure if the TokensRegexAnnotator can perform full string matches on sentences, so you probably need to use some code like this:

package edu.stanford.nlp.examples;

import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.tokensregex.Env;
import edu.stanford.nlp.ling.tokensregex.TokenSequencePattern;
import edu.stanford.nlp.ling.tokensregex.TokenSequenceMatcher;
import edu.stanford.nlp.pipeline.*;

import java.util.*;

public class TokensRegexExactMatch {

  public static void main(String[] args) {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = new Annotation("Training & Courses");
    pipeline.annotate(annotation);
    //System.err.println(IOUtils.stringFromFile("course.rules"));
    Env env = TokenSequencePattern.getNewEnv();
    env.bind("$EDU_WORD_ONE", "/Education|Educational|Courses/");
    env.bind("$EDU_WORD_TWO", "/Background|Qualification/");
    TokenSequencePattern pattern = TokenSequencePattern.compile(env, "$EDU_WORD_ONE $EDU_WORD_TWO?");
    List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
    TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
    // matcher.matches()
    while (matcher.find()) {
      System.err.println("---");
      String matchedString = matcher.group();
      List<CoreMap> matchedTokens = matcher.groupNodes();
      System.err.println(matchedTokens);
    }
  }
}

这篇关于如何在StanfordNLP中修改TokenRegex规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-09 23:28