问题描述
我正在尝试创建一个匹配器,该匹配器在文本中查找否定的自定义实体.对于跨越单个令牌的实体,它工作正常,但是我在尝试捕获跨越多个令牌的实体时遇到了麻烦.
I am trying to create a matcher that finds negated custom entities in the text. It is working fine for entities that span a single token, but I am having trouble trying to capture entities that span more than one token.
例如,假设我的自定义实体是动物(并标记为token.ent_type_ = "animal"
)
As an example, let's say that my custom entities are animals (and are labeled as token.ent_type_ = "animal"
)
["cat", "dog", "artic fox"]
(请注意,最后一个实体有两个词).
["cat", "dog", "artic fox"]
(note that the last entity has two words).
现在,我想在文本中找到但被否定的那些实体,因此我可以使用以下模式创建一个简单的匹配器:
Now I want to find those entities in the text but negated, so I can create a simple matcher with the following pattern:
[{'lower': 'no'}, {'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
例如,我有以下文字:
There is no cat in the house and no artic fox in the basement
我可以成功捕获no cat
和no artic
,但是最后一次匹配不正确,因为完全匹配应该是no artic fox
.这是由于模式中的OP: '+'
与单个自定义实体而不是两个匹配.如何修改模式以将较长的匹配优先于较短的匹配?
I can successfully capture no cat
and no artic
, but the last match is incorrect as the full match should be no artic fox
. This is due to the OP: '+'
in the pattern that matches a single custom entity instead of two. How can I modify the pattern to prioritize longer matches over shorter ones?
推荐答案
一种解决方案是使用 doc retokenize方法,以便将每个多令牌实体的各个令牌合并为一个令牌:
A solution is to use the doc retokenize method in order to merge the individual tokens of each multi-token entity into a single token:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
animal = ["cat", "dog", "artic fox"]
ruler = EntityRuler(nlp)
for a in animal:
ruler.add_patterns([{"label": "animal", "pattern": a}])
nlp.add_pipe(ruler)
doc = nlp("There is no cat in the house and no artic fox in the basement")
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end])
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern =[{'lower': 'no'},{'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
matcher.add('negated animal', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span)
现在的输出是:
这篇关于Spacy,匹配器具有跨越单个令牌的多个实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!