问题描述
我想要输出为["good customer service","great ambience"]
,但我却得到了["good customer","good customer service","great ambience"]
,因为模式也与良好的客户匹配,但这句话没有任何意义.我该如何删除这些重复项
I want output as ["good customer service","great ambience"]
but I am getting ["good customer","good customer service","great ambience"]
because pattern is matching with good customer also but this phrase doesn't make any sense. How can I remove these kind of duplicates
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("good customer service and great ambience")
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: adjective followed by one or more noun
pattern = [{"POS": 'ADJ'},{"POS": 'NOUN', "OP": '+'}]
matcher.add("ADJ_NOUN_PATTERN", None,pattern)
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
推荐答案
您可以通过将元组与起始索引分组并仅保留具有最大终止索引的元组来对匹配进行后处理:
You may post-process the matches by grouping the tuples against the start index and only keeping the one with the largest end index:
from itertools import *
#...
matches = matcher(doc)
results = [max(list(group),key=lambda x: x[2]) for key, group in groupby(matches, lambda prop: prop[1])]
print("Matches:", [doc[start:end].text for match_id, start, end in results])
# => Matches: ['good customer service', 'great ambience']
groupby(matches, lambda prop: prop[1])
将按开始索引对匹配进行分组,此处为[(5488211386492616699, 0, 2), (5488211386492616699, 0, 3)]
和(5488211386492616699, 4, 6)
. max(list(group),key=lambda x: x[2])
将获取最终索引(值3)最大的项目.
The groupby(matches, lambda prop: prop[1])
will group the matches by the start index, here, resulting in [(5488211386492616699, 0, 2), (5488211386492616699, 0, 3)]
and (5488211386492616699, 4, 6)
. max(list(group),key=lambda x: x[2])
will grab the item where end index (Value #3) is the biggest.
这篇关于匹配器返回一些重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!