问题描述
我想运行一个在源代码中包含 unicode (utf-8) 字符的 Python 源文件.我知道这可以通过在开头添加注释 # -*- coding: utf-8 -*-
来完成.但是,我希望不使用这种方法.
I want to run a Python source file that contains unicode (utf-8) characters in the source. I am aware of the fact that this can be done by adding the comment # -*- coding: utf-8 -*-
in the beginning. However, I wish to do it without using this method.
我能想到的一种方法是以转义形式编写 unicode 字符串.例如,
One way I could think of was writing the unicode strings in escaped form. For example,
更新源.添加了 Unicode 注释.
# Printing naïve and 男孩
def fxn():
print 'naïve'
print '男孩'
fxn()
变成
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
关于上述方法,我有两个问题.
I have two questions regarding the above method.
- 我如何使用 Python 将第一个代码片段转换为它的等效代码遵循它?也就是说,只应写入 unicode 序列转义形式.
- 考虑到仅使用 unicode (utf-8) 字符,该方法是否万无一失?有什么地方会出错吗?
推荐答案
如果你只使用字节串,并将你的源文件编码为 UTF-8,你的字节串将包含 UTF-8- 编码数据.不需要编码声明(虽然你不想使用它真的很奇怪......这只是一个评论).编码语句让 Python 知道源文件的编码,因此它可以正确解码 Unicode 字符串(u'xxxxx'
).如果您没有 Unicode 字符串,也没关系.
If you only use byte strings, and save your source file encoded as UTF-8, your byte strings will contain UTF-8-encoded data. No need for the coding statement (although REALLY strange that you don't want to use it...it's just a comment). The coding statement let's Python know the encoding of the source file, so it can decode Unicode strings correctly (u'xxxxx'
). If you have no Unicode strings, it doesn't matter.
对于您的问题,无需转换为转义码.如果您将文件编码为 UTF-8,则可以在字节字符串中使用更易读的字符.
For your questions, no need to convert to escape codes. If you encode the file as UTF-8, you can use the more readable characters in your byte strings.
仅供参考,这不适用于 Python 3,因为该版本中的字节字符串不能包含非 ASCII.
FYI, that won't work for Python 3, because byte strings cannot contain non-ASCII in that version.
也就是说,这里有一些代码可以根据要求转换您的示例.它读取源代码,假设它是用 UTF-8 编码的,然后使用正则表达式来定位所有非 ASCII 字符.它通过转换函数传递它们以生成替换.这应该是安全的,因为非 ASCII 只能用于 Python 2 中的字符串文字和常量.然而,Python 3 允许在变量名中使用非 ASCII,因此这在那里不起作用.
That said, here's some code that will convert your example as requested. It reads the source assuming it is encoded in UTF-8, then uses a regular expression to locate all non-ASCII characters. It passes them through a conversion function to generate the replacement. This should be safe, since non-ASCII can only be used in string literals and constants in Python 2. Python 3, however, allows non-ASCII in variable names so this wouldn't work there.
import io
import re
def escape(m):
char = m.group(0).encode('utf8')
return ''.join(r'\x{:02x}'.format(ord(b)) for b in char)
with io.open('sample.py',encoding='utf8') as f:
content = f.read()
new_content = re.sub(r'[^\x00-\x7f]',escape,content)
with io.open('sample_new.py','w',encoding='utf8') as f:
f.write(new_content)
结果:
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
这篇关于在源代码中运行带有 Unicode 字符的 Python 2.7 代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!