问题描述
我需要基于unicode字符串构建一个re模式(例如,我有 word,并且我需要类似^ word | word的内容)。但是,单词可以包含特殊的re字符。要按原样匹配单词,我需要在unicode字符串中转义特殊的re字符。基本的re.escape()函数可完成ascii字符串的工作。
I need to build an re pattern based on the unicode string (e.g. I have "word", and I need something like ^"word"| "word"). However the "word" can contain special re characters. To match the "word" as it is, I need to escape special re characters in unicode string. The basic re.escape() function does the job for ascii strings. How can I do this for unicode?
推荐答案
re.escape()
在每个不是ASCII字母数字的字符之前插入反斜杠。实际上,这实际上可能会导致插入许多不必要的反斜杠,但是,Python会忽略不会启动公认的转义序列的反斜杠,因此不会造成很大的危害(可能会降低性能)。
re.escape()
inserts a backslash before every character that's not an ASCII alphanumeric. This may in fact lead to a multitude of unnecessary backslashes to be inserted, however, Python ignores backslashes that don't start a recognized escape sequence, so there is no big harm done (except possibly some performance penalty).
但是如果您想构建更严格的 escape()
,则可以:
But if you want to build a stricter escape()
, you can:
def escape(s):
return re.sub(r"[(){}\[\].*?|^$\\+-]", r"\\\g<0>", s)
仅涉及实际的正则表达式元字符。我当然希望我不会错过任何一个:)
which only touches the actual regex metacharacters. I sure hope I didn't miss any :)
这篇关于如何对正则表达式转义unicode字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!