Python:匹配字符串中的多个子字符串

本文介绍了Python:匹配字符串中的多个子字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Python，我想将给定的字符串与多个子字符串进行匹配.我试图以两种不同的方式解决这个问题.我的第一个解决方案是将子字符串与如下字符串匹配:

I am working with Python and I want to match a given string with multiple substrings. I have tried to solve this problem in two different ways. My first solution was to match the substring with the string like:

str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([x.upper() for x in value if x.lower() in str.lower()])
print(temp)

导致 temp = ["TEST", "MATCH", "MULTIPLE", "RING"]

which results in temp = ["TEST", "MATCH", "MULTIPLE", "RING"]

然而，这不是我想要的结果.子串应该完全匹配，所以ring"不应该与string"匹配.

However, this is not the result I would like. The substrings should have an exact match, so "ring" should not match with "string".

这就是我尝试用正则表达式解决这个问题的原因，就像这样:

This is why I tried to solve this problem with regular expressions, like this:

str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([x.upper() for x in value if regex.search(r"\b" + regex.escape(x) + r"\b", str,
                                                   regex.IGNORECASE) is not None])
print(temp)

导致 ["TEST", "MATCH", "MULTIPLE"]，正确的解决方案.尽管如此，该解决方案的计算时间太长.我必须对大约 100 万个字符串进行此检查，与使用第一个解决方案所需的 1.5 小时相比，使用正则表达式的解决方案需要数天才能完成.

which results in ["TEST", "MATCH", "MULTIPLE"], the correct solution. Be that as it may, this solution takes too long to compute. I have to do this check for roughly 1 million strings and the solution using regex will take days to finish compared to the 1.5 hours it takes using the first solution.

我想知道是否有办法使第一个解决方案起作用，或者使第二个解决方案运行得更快.提前致谢

I would like to know if there a way to either make the first solution work, or the second solution to run faster. Thanks in advance

value 也可以包含数字，或者像test1 test2"这样的短语

value can also contain numbers, or a short phrase like "test1 test2"

推荐答案

在没有看到实际数据的情况下很难提出最佳解决方案，但您可以尝试以下方法:

It's hard to suggest an optimal solution without seeing the actual data, but you can try these things:

生成一个匹配所有值的模式.这样，您只需搜索字符串一次(而不是每个值一次).
跳过转义值，除非它们包含特殊字符(如 '^' 或 '*').
将结果直接分配给 temp，避免使用 temp.extend() 进行不必要的复制.

Generate a single pattern matching all values. This way you would only need to search the string once (instead of once per value).
Skip escaping values unless they contain special characters (like '^' or '*').
Assign the result directly to temp, avoiding unnecessary copying with temp.extend().

import regex

# 'str' is a built-in name, so use 'string' instead
string = 'This is a Test string from which I want to match multiple substrings'
values = ['test', 'test2', 'Multiple', 'ring', 'match']
pattern = r'\b({})\b'.format('|'.join(map(regex.escape, values)))

# unique matches, lowercased
matches = set(map(str.lower, regex.findall(pattern, string, regex.IGNORECASE)))

# arrange the results as they appear in `values`
temp = [x.upper() for x in values if x.lower() in matches]
print(temp)  # ['TEST', 'MULTIPLE', 'MATCH']

这篇关于Python:匹配字符串中的多个子字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！