问题描述
我有一个数据框,我需要在其中找到与 terms
匹配的所有可能的匹配行.我的代码是
I have a data frame, in which I need to find all the possible matches rows which match with terms
. My code is
texts = ['foo abc', 'foobar xyz', 'xyz baz32', 'baz 45','fooz','bazzar','foo baz']
terms = ['foo','baz','foo baz']
# create df
df = pd.DataFrame({'Match_text': texts})
#cretae pattern
pat = r'\b(?:{})\b'.format('|'.join(terms))
# use str.contains to find matchs
df = df[df['Match_text'].str.contains(pat)]
#create pattern
p = re.compile(pat)
#search for pattern in the column
results = [p.findall(text) for text in df.Match_text.tolist()]
df['results'] = results
输出为
Match_text results
0 foo abc [foo]
3 baz 45 [baz]
6 foo baz [foo, baz]
其中,foo baz
也与第 6 行以及 foo
和 baz
匹配.我需要获取 terms
In which, foo baz
is also matching with row 6 along with foo
, and baz
. I need to get rows for all matches which are in the terms
推荐答案
较长的选项应该在较短的选项之前,因此,您需要按长度降序对关键字进行排序:
The longer alternatives should come before the shorter ones, thus, you need to sort the keywords by length in the descending order:
pat = r'\b(?:{})\b'.format('|'.join(sorted(terms,key=len,reverse=True)))
结果将是 \b(?:foo baz|foo|baz)\b
模式.它将首先尝试匹配foo baz
,然后是foo
,然后是baz
.如果找到foo baz
,则返回匹配项,然后从匹配项的末尾开始搜索下一个匹配项,因此不会匹配foo
或baz
再次找到了之前的匹配项.
The result will be \b(?:foo baz|foo|baz)\b
pattern. It will first try to match foo baz
, then foo
, then baz
. If foo baz
is found, the match is returned, then the next match is searched for from the end of the match, so you won't match foo
or baz
found with the previous match again.
这篇关于如何在 python 正则表达式中使用 str.contains 获取所有匹配项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!