问题描述
Python将\ uxxxx视为字符串文字内的Unicode字符转义符(例如u"\ u2014"被解释为Unicode字符U + 2014).但是我刚刚发现(Python 2.7)标准正则表达式模块不会将\ uxxxx视为Unicode字符.示例:
Python treats \uxxxx as a unicode character escape inside a string literal (e.g. u"\u2014" gets interpreted as Unicode character U+2014). But I just discovered (Python 2.7) that standard regex module doesn't treat \uxxxx as a unicode character. Example:
codepoint = 2014 # Say I got this dynamically from somewhere
test = u"This string ends with \u2014"
pattern = r"\u%s$" % codepoint
assert(pattern[-5:] == "2014$") # Ends with an escape sequence for U+2014
assert(re.search(pattern, test) != None) # Failure -- No match (bad)
assert(re.search(pattern, "u2014")!= None) # Success -- This matches (bad)
很明显,如果您能够将正则表达式模式指定为字符串文字,那么您可以起到与正则表达式引擎本身理解\ uxxxx转义相同的作用:
Obviously if you are able to specify your regex pattern as a string literal, then you can have the same effect as if the regex engine itself understood \uxxxx escapes:
test = u"This string ends with \u2014"
pattern = u"\u2014$"
assert(pattern[:-1] == u"\u2014") # Ends with actual unicode char U+2014
assert(re.search(pattern, test) != None)
但是,如果您需要动态构建模式怎么办?
But what if you need to construct your pattern dynamically?
推荐答案
使用 unichr()
函数从代码点创建Unicode字符:
Use the unichr()
function to create unicode characters from a codepoint:
pattern = u"%s$" % unichr(codepoint)
这篇关于python re(regex)是否可以替代\ u unicode转义序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!