问题描述
我对正则表达式中的反斜杠感到困惑.在正则表达式中, \
具有特殊含义,例如\d
表示十进制数字.如果在反斜杠前面添加反斜杠,则此特殊含义将丢失.在 regex-howto 中可以阅读:
I am confused with the backslash in regular expressions. Within a regex a \
has a special meaning, e.g. \d
means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
也许最重要的元字符是反斜杠 \
.与 Python 字符串文字一样,反斜杠后面可以跟各种字符以表示各种特殊序列.它还用于转义所有元字符,以便您仍然可以在模式中匹配它们;例如,如果您需要匹配一个 [
或 \
,您可以在它们前面加上一个反斜杠以去除它们的特殊含义:\[
或\\
.
所以 print(re.search('\d', '\d'))
给出 None
因为 \d
匹配任何小数数字,但 \d
中没有.
So print(re.search('\d', '\d'))
gives None
because \d
matches any decimal digit but there is none in \d
.
我现在希望 print(re.search('\\d', '\d'))
匹配 \d
但答案仍然是 无
.
I now would expect print(re.search('\\d', '\d'))
to match \d
but the answer is still None
.
只有 print(re.search('\\\d', '\d'))
给出输出 <_sre.SRE_Match 对象;span=(0, 2), match='\\d'>
.
有人解释一下吗?
推荐答案
混淆是由于反斜杠字符 \
被用作两个不同级别的转义符.首先,Python 解释器本身会在 re
模块看到您的字符串之前执行对 \
的替换.例如,\n
被转换为换行符,\t
被转换为制表符等.获得一个实际的 \
字符,你也可以转义它,所以 \\
给出一个 \
字符.如果 \
后面的字符不是可识别的转义字符,则 \
将被视为任何其他字符并通过,但我不建议依赖于此.相反,始终通过将 \
字符加倍来转义它们,即 \\
.
The confusion is due to the fact that the backslash character \
is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \
before the re
module ever sees your string. For instance, \n
is converted to a newline character, \t
is converted to a tab character, etc. To get an actual \
character, you can escape it as well, so \\
gives a single \
character. If the character following the \
isn't a recognized escape character, then the \
is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \
characters by doubling them, i.e. \\
.
如果您想查看 Python 如何扩展您的字符串转义,只需打印出该字符串.例如:
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
如果 s
是聚合数据类型的一部分,例如一个列表或一个元组,如果您打印该聚合,Python 会将字符串括在单引号中并包含 \
转义符(以规范形式),因此请注意您的字符串是如何打印.如果你只是在解释器中输入一个带引号的字符串,它也会显示它用 \
转义的引号括起来.
If s
is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \
escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \
escapes.
一旦您知道您的字符串是如何被编码的,您就可以考虑 re
模块将如何处理它.例如,如果您想在传递给 re
模块的字符串中转义 \
,则需要将 \\
传递给 re
,这意味着您需要在引用的 Python 字符串中使用 \\\\
.Python 字符串将以 \\
结尾,re
模块会将其视为单个文字 \
字符.
Once you know how your string is being encoded, you can then think about what the re
module will do with it. For instance, if you want to escape \
in a string you pass to the re
module, you will need to pass \\
to re
, which means you will need to use \\\\
in your quoted Python string. The Python string will end up with \\
and the re
module will treat this as a single literal \
character.
在 Python 字符串中包含 \
字符的另一种方法是使用原始字符串,例如r'a\b'
等价于 "a\\b"
.
An alternative way to include \
characters in Python strings is to use raw strings, e.g. r'a\b'
is equivalent to "a\\b"
.
这篇关于对正则表达式中的反斜杠感到困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!