Spacy，匹配器具有跨越单个令牌的多个实体 | 匹配器具有跨越单个令牌的多个实体

本文介绍了Spacy，匹配器具有跨越单个令牌的多个实体的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试创建一个匹配器，该匹配器在文本中查找否定的自定义实体.对于跨越单个令牌的实体，它工作正常，但是我在尝试捕获跨越多个令牌的实体时遇到了麻烦.

I am trying to create a matcher that finds negated custom entities in the text. It is working fine for entities that span a single token, but I am having trouble trying to capture entities that span more than one token.

例如，假设我的自定义实体是动物(并标记为token.ent_type_ = "animal")

As an example, let's say that my custom entities are animals (and are labeled as token.ent_type_ = "animal")

["cat", "dog", "artic fox"](请注意，最后一个实体有两个词).

["cat", "dog", "artic fox"] (note that the last entity has two words).

现在，我想在文本中找到但被否定的那些实体，因此我可以使用以下模式创建一个简单的匹配器:

Now I want to find those entities in the text but negated, so I can create a simple matcher with the following pattern:

[{'lower': 'no'}, {'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]

例如，我有以下文字:

There is no cat in the house and no artic fox in the basement

我可以成功捕获no cat和no artic，但是最后一次匹配不正确，因为完全匹配应该是no artic fox.这是由于模式中的OP: '+'与单个自定义实体而不是两个匹配.如何修改模式以将较长的匹配优先于较短的匹配?

I can successfully capture no cat and no artic, but the last match is incorrect as the full match should be no artic fox. This is due to the OP: '+' in the pattern that matches a single custom entity instead of two. How can I modify the pattern to prioritize longer matches over shorter ones?

推荐答案

一种解决方案是使用 doc retokenize方法，以便将每个多令牌实体的各个令牌合并为一个令牌:

A solution is to use the doc retokenize method in order to merge the individual tokens of each multi-token entity into a single token:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

animal = ["cat", "dog", "artic fox"]
ruler = EntityRuler(nlp)
for a in animal:
    ruler.add_patterns([{"label": "animal", "pattern": a}])
nlp.add_pipe(ruler)


doc = nlp("There is no cat in the house and no artic fox in the basement")

with doc.retokenize() as retokenizer:
    for ent in doc.ents:
        retokenizer.merge(doc[ent.start:ent.end])


from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern =[{'lower': 'no'},{'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
matcher.add('negated animal', None, pattern)
matches = matcher(doc)


for match_id, start, end in matches:
    span = doc[start:end]
    print(span)

现在的输出是:

这篇关于Spacy，匹配器具有跨越单个令牌的多个实体的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！