问题描述
我有tokenregex的规则文件为
I have rule file for tokenregex as
$EDU_FIRST_KEYWORD = (/Education/|/Course[s]?/|/Educational/|/Academic/|/Education/ /and/?|/Professional/|/Certification[s]?/ /and/?)
$EDU_LAST_KEYWORD = (/Background/|/Qualification[s]?/|/Training[s]?/|/Detail[s]?/|/Record[s]?/)
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ ruleType: "tokens", pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?), result: "EDUCATION"}
{ ruleType: "tokens", pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?), result: "EDUCATION"}
我要匹配EDU_FIRST_KEYWORD
,然后再匹配EDU_LAST_KEYWORD
.如果两个部分都不匹配,请检查EDU_FIRST_KEYWORD
是否在给定的字符串中匹配.
I want to match EDU_FIRST_KEYWORD
followed by EDU_LAST_KEYWORD
. If it does not match both parts, then check if EDU_FIRST_KEYWORD
matches in given string.
例如1.培训与培训课程
E.g. 1. Training & Courses
匹配的输出:教育(因为它与课程匹配,这不应该发生)
Matched Output: EDUCATION (as it matched Courses, which should not happen)
预期输出:无输出
这是因为它与字符串的第一部分或完整的字符串都不匹配.
- 教育背景
匹配的输出:教育
预期产量:教育
我尝试将pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?)
更改为pattern: ( $EDU_FIRST_KEYWORD + $EDU_LAST_KEYWORD ?)
,但无济于事.
I tried changing pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?)
to pattern: ( $EDU_FIRST_KEYWORD + $EDU_LAST_KEYWORD ?)
but it does not help.
我尝试了stanfordNLP tokenregex文档,但无法获得实现方法.有人可以帮我更改规则文件吗?预先感谢.
I tried stanfordNLP tokenregex documentation, but could not get how to achieve this. Can somebody help me changing rule file?Thanks in advance.
推荐答案
您想使用TokenSequenceMatcher的matches()
方法使您的规则针对整个String运行.
You want to use the matches()
method of TokenSequenceMatcher to have your rule run against the entire String.
如果使用find()
,它将搜索整个字符串...如果使用matches()
,它将查看整个字符串是否与模式匹配.
If you use find()
it will search the entire string...if you use matches()
it will see if the entire string matches the pattern.
目前,我不确定TokensRegexAnnotator是否可以对句子执行完整的字符串匹配,因此您可能需要使用如下代码:
At this time I am not sure if the TokensRegexAnnotator can perform full string matches on sentences, so you probably need to use some code like this:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.tokensregex.Env;
import edu.stanford.nlp.ling.tokensregex.TokenSequencePattern;
import edu.stanford.nlp.ling.tokensregex.TokenSequenceMatcher;
import edu.stanford.nlp.pipeline.*;
import java.util.*;
public class TokensRegexExactMatch {
public static void main(String[] args) {
Properties props = new Properties();
props.setProperty("annotators", "tokenize");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation("Training & Courses");
pipeline.annotate(annotation);
//System.err.println(IOUtils.stringFromFile("course.rules"));
Env env = TokenSequencePattern.getNewEnv();
env.bind("$EDU_WORD_ONE", "/Education|Educational|Courses/");
env.bind("$EDU_WORD_TWO", "/Background|Qualification/");
TokenSequencePattern pattern = TokenSequencePattern.compile(env, "$EDU_WORD_ONE $EDU_WORD_TWO?");
List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
// matcher.matches()
while (matcher.find()) {
System.err.println("---");
String matchedString = matcher.group();
List<CoreMap> matchedTokens = matcher.groupNodes();
System.err.println(matchedTokens);
}
}
}
这篇关于如何在StanfordNLP中修改TokenRegex规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!