如何对正则表达式转义unicode字符串

如何对正则表达式转义unicode字符串

本文介绍了如何对正则表达式转义unicode字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要基于unicode字符串构建一个re模式(例如,我有 word,并且我需要类似^ word | word的内容)。但是,单词可以包含特殊的re字符。要按原样匹配单词,我需要在unicode字符串中转义特殊的re字符。基本的re.escape()函数可完成ascii字符串的工作。

I need to build an re pattern based on the unicode string (e.g. I have "word", and I need something like ^"word"| "word"). However the "word" can contain special re characters. To match the "word" as it is, I need to escape special re characters in unicode string. The basic re.escape() function does the job for ascii strings. How can I do this for unicode?

推荐答案

re.escape()在每个不是ASCII字母数字的字符之前插入反斜杠。实际上,这实际上可能会导致插入许多不必要的反斜杠,但是,Python会忽略不会启动公认的转义序列的反斜杠,因此不会造成很大的危害(可能会降低性能)。

re.escape() inserts a backslash before every character that's not an ASCII alphanumeric. This may in fact lead to a multitude of unnecessary backslashes to be inserted, however, Python ignores backslashes that don't start a recognized escape sequence, so there is no big harm done (except possibly some performance penalty).

但是如果您想构建更严格的 escape(),则可以:

But if you want to build a stricter escape(), you can:

def escape(s):
    return re.sub(r"[(){}\[\].*?|^$\\+-]", r"\\\g<0>", s)

仅涉及实际的正则表达式元字符。我当然希望我不会错过任何一个:)

which only touches the actual regex metacharacters. I sure hope I didn't miss any :)

这篇关于如何对正则表达式转义unicode字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 12:13