本文介绍了如何在 python 正则表达式中使用 str.contains 获取所有匹配项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我需要在其中找到与 terms 匹配的所有可能的匹配行.我的代码是

I have a data frame, in which I need to find all the possible matches rows which match with terms. My code is

texts = ['foo abc', 'foobar xyz', 'xyz baz32', 'baz 45','fooz','bazzar','foo baz']
terms = ['foo','baz','foo baz']
# create df
df = pd.DataFrame({'Match_text': texts})
#cretae pattern 
pat = r'\b(?:{})\b'.format('|'.join(terms))
# use str.contains to find matchs
df = df[df['Match_text'].str.contains(pat)]

#create pattern
p = re.compile(pat)

#search for pattern in the column
results = [p.findall(text) for text in df.Match_text.tolist()]
df['results'] = results

输出为

Match_text  results
0   foo abc     [foo]
3   baz 45      [baz]
6   foo baz     [foo, baz]

其中,foo baz 也与第 6 行以及 foobaz 匹配.我需要获取 terms

In which, foo baz is also matching with row 6 along with foo, and baz. I need to get rows for all matches which are in the terms

推荐答案

较长的选项应该在较短的选项之前,因此,您需要按长度降序对关键字进行排序:

The longer alternatives should come before the shorter ones, thus, you need to sort the keywords by length in the descending order:

pat = r'\b(?:{})\b'.format('|'.join(sorted(terms,key=len,reverse=True)))

结果将是 \b(?:foo baz|foo|baz)\b 模式.它将首先尝试匹配foo baz,然后是foo,然后是baz.如果找到foo baz,则返回匹配项,然后从匹配项的末尾开始搜索下一个匹配项,因此不会匹配foobaz 再次找到了之前的匹配项.

The result will be \b(?:foo baz|foo|baz)\b pattern. It will first try to match foo baz, then foo, then baz. If foo baz is found, the match is returned, then the next match is searched for from the end of the match, so you won't match foo or baz found with the previous match again.

"请记住,正则表达式引擎是急切的".

这篇关于如何在 python 正则表达式中使用 str.contains 获取所有匹配项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 09:48