问题描述
我正在使用 Python,我想将给定的字符串与多个子字符串进行匹配.我试图以两种不同的方式解决这个问题.我的第一个解决方案是将子字符串与如下字符串匹配:
I am working with Python and I want to match a given string with multiple substrings. I have tried to solve this problem in two different ways. My first solution was to match the substring with the string like:
str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([x.upper() for x in value if x.lower() in str.lower()])
print(temp)
导致 temp = ["TEST", "MATCH", "MULTIPLE", "RING"]
which results in temp = ["TEST", "MATCH", "MULTIPLE", "RING"]
然而,这不是我想要的结果.子串应该完全匹配,所以ring"不应该与string"匹配.
However, this is not the result I would like. The substrings should have an exact match, so "ring" should not match with "string".
这就是我尝试用正则表达式解决这个问题的原因,就像这样:
This is why I tried to solve this problem with regular expressions, like this:
str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([x.upper() for x in value if regex.search(r"\b" + regex.escape(x) + r"\b", str,
regex.IGNORECASE) is not None])
print(temp)
导致 ["TEST", "MATCH", "MULTIPLE"],正确的解决方案.尽管如此,该解决方案的计算时间太长.我必须对大约 100 万个字符串进行此检查,与使用第一个解决方案所需的 1.5 小时相比,使用正则表达式的解决方案需要数天才能完成.
which results in ["TEST", "MATCH", "MULTIPLE"], the correct solution. Be that as it may, this solution takes too long to compute. I have to do this check for roughly 1 million strings and the solution using regex will take days to finish compared to the 1.5 hours it takes using the first solution.
我想知道是否有办法使第一个解决方案起作用,或者使第二个解决方案运行得更快.提前致谢
I would like to know if there a way to either make the first solution work, or the second solution to run faster. Thanks in advance
value
也可以包含数字,或者像test1 test2"这样的短语
value
can also contain numbers, or a short phrase like "test1 test2"
推荐答案
在没有看到实际数据的情况下很难提出最佳解决方案,但您可以尝试以下方法:
It's hard to suggest an optimal solution without seeing the actual data, but you can try these things:
- 生成一个匹配所有值的模式.这样,您只需搜索字符串一次(而不是每个值一次).
- 跳过转义值,除非它们包含特殊字符(如
'^'
或'*'
). - 将结果直接分配给
temp
,避免使用temp.extend()
进行不必要的复制.
- Generate a single pattern matching all values. This way you would only need to search the string once (instead of once per value).
- Skip escaping values unless they contain special characters (like
'^'
or'*'
). - Assign the result directly to
temp
, avoiding unnecessary copying withtemp.extend()
.
import regex
# 'str' is a built-in name, so use 'string' instead
string = 'This is a Test string from which I want to match multiple substrings'
values = ['test', 'test2', 'Multiple', 'ring', 'match']
pattern = r'\b({})\b'.format('|'.join(map(regex.escape, values)))
# unique matches, lowercased
matches = set(map(str.lower, regex.findall(pattern, string, regex.IGNORECASE)))
# arrange the results as they appear in `values`
temp = [x.upper() for x in values if x.lower() in matches]
print(temp) # ['TEST', 'MULTIPLE', 'MATCH']
这篇关于Python:匹配字符串中的多个子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!