问题描述
如果我指定字符编码(如所建议)在python模块的魔术线或shebang中,例如
#-*-编码:utf-8-*-
我可以从该模块中检索此编码吗?
(在Windows 7 x64上使用Python 2.7.9进行操作)
我尝试(未成功)检索了默认编码或shebang
#-*-编码:utf-8-*-
导入系统从shebang导入
导入shebang
打印 sys.getdefaultencoding():,sys.getdefaultencoding()
打印 shebang:,shebang(__file __。rstrip( oc ))
将产生:
(与iso-8859-1相同)
我会借用Python 3 在Python 2中进行了一些调整,以符合Python 2的期望。我已将函数签名更改为接受文件名,并删除了到目前为止已包含的行;您不需要使用它们:
import re
从编解码器导入查找,BOM_UTF8
cookie_re = re.compile(r'^ [\t\f] *#。*?coding [:=] [\t] *([-\w。] +)')
blank_re = re.compile(br'^ [\t\f] *(?:[#\r\n] | $)')
def _get_normal_name(orig_enc ):
在tokenizer.c中模仿get_normal_name。
#只关心前12个字符。
enc = orig_enc [:12] .lower()。replace( _,-)
如果enc == utf-8或enc.startswith( utf-8- ):
如果enc in( latin-1, iso-8859-1, iso-latin-1)或\
enc,则返回 utf-8
.startswith(( latin-1-, iso-8859-1-, iso-latin-1-)):
返回 iso-8859-1
返回orig_enc
def detect_encoding(文件名):
bom_found = False
编码= None
默认='ascii'
def find_cookie(line):
match = cookie_re.match(line)
如果不匹配:
返回None
编码= _get_normal_name(match.group(1))
try:
codec = lookup(encoding)
,除了LookupError:
#此行为类似于Python解释器
提高SyntaxError(
{!r}的未知编码:{}。format (
文件名,编码))
如果bom_found:
if encoding!='utf-8':
#此行为模仿Python解释器
提高SyntaxError(
'{!r}的编码问题:utf-8'.format(filename) )
编码+ ='-sig'
返回编码
,其中open(filename,'rb')as fileobj:
first = next(fileobj,'' )
if first.startswith(BOM_UTF8):
bom_found = True
first = first [3:]
default ='utf-8-sig'
first:
返回默认值
编码= find_cookie(first)
如果编码:
返回编码
如果非blank_re.match(first):
返回默认值
第二个= next(fileobj,'')
如果不是第二个:
返回默认值
返回find_cookie(第二个)或默认值
就像原来的功能一样,上面的乐趣该部分从源文件中读取两行 max ,如果cookie中的编码无效或不是UTF-8,则会引发 SyntaxError
异常
演示:
>>>导入临时文件
>>> def测试(内容):
... tempfile.NamedTemporaryFile()为f:
... f.write(内容)
... f.flush()
... return detect_encoding(f.name)
...
>> test(’#-*-编码:utf-8-*-\n’)
‘utf-8’
>>> test('#!/ bin / env python\n#-*-编码:latin-1-*-\n')
'iso-8859-1'
>>> ;测试(导入this\n)
‘ascii’
>>>导入编解码器
>>>测试(codecs.BOM_UTF8 +导入此\n)
utf-8-sig
>>>测试(codecs.BOM_UTF8 +‘#编码:latin-1\n’)
追溯(最近一次调用):
文件< stdin>,< module>中的第1行
文件< stdin>,第5行,在测试中
文件< string>,第37行,在detect_encoding
文件< string>,第24行,在find_cookie
SyntaxError:'/ var / folders / w0 / nl1bwj6163j2pvxswf84xcsjh2pc5g / T / tmpxsqH8L'的编码问题:utf-8
>>测试(#编码:foobarbaz\n)
追溯(最近一次通话结束):
文件< stdin>,在< module>中的第1行
文件< stdin>,第5行,在测试中
文件< string>,行37,在detect_encoding
文件< string>,第18行,在find_cookie
语法错误:'/ var / folders / w0 / nl1bwj6163j2pvxswf84xcsjh2pc5g / T / tmpHiHdG3'的未知编码:foobarbaz
If I specify the character encoding (as suggested by PEP 263) in the "magic line" or shebang of a python module like
# -*- coding: utf-8 -*-
can I retrieve this encoding from within that module?
(Working on Windows 7 x64 with Python 2.7.9)
I tried (without success) to retrieve the default encoding or shebang
# -*- coding: utf-8 -*-
import sys
from shebang import shebang
print "sys.getdefaultencoding():", sys.getdefaultencoding()
print "shebang:", shebang( __file__.rstrip("oc"))
will yield:
(same for iso-8859-1)
I'd borrow the Python 3 tokenize.detect_encoding()
function in Python 2, adjusted a little to match Python 2 expectations. I've changed the function signature to accept a filename and dropped the inclusion of the lines read so far; you don't need those for your usecase:
import re
from codecs import lookup, BOM_UTF8
cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)')
blank_re = re.compile(br'^[ \t\f]*(?:[#\r\n]|$)')
def _get_normal_name(orig_enc):
"""Imitates get_normal_name in tokenizer.c."""
# Only care about the first 12 characters.
enc = orig_enc[:12].lower().replace("_", "-")
if enc == "utf-8" or enc.startswith("utf-8-"):
return "utf-8"
if enc in ("latin-1", "iso-8859-1", "iso-latin-1") or \
enc.startswith(("latin-1-", "iso-8859-1-", "iso-latin-1-")):
return "iso-8859-1"
return orig_enc
def detect_encoding(filename):
bom_found = False
encoding = None
default = 'ascii'
def find_cookie(line):
match = cookie_re.match(line)
if not match:
return None
encoding = _get_normal_name(match.group(1))
try:
codec = lookup(encoding)
except LookupError:
# This behaviour mimics the Python interpreter
raise SyntaxError(
"unknown encoding for {!r}: {}".format(
filename, encoding))
if bom_found:
if encoding != 'utf-8':
# This behaviour mimics the Python interpreter
raise SyntaxError(
'encoding problem for {!r}: utf-8'.format(filename))
encoding += '-sig'
return encoding
with open(filename, 'rb') as fileobj:
first = next(fileobj, '')
if first.startswith(BOM_UTF8):
bom_found = True
first = first[3:]
default = 'utf-8-sig'
if not first:
return default
encoding = find_cookie(first)
if encoding:
return encoding
if not blank_re.match(first):
return default
second = next(fileobj, '')
if not second:
return default
return find_cookie(second) or default
Like the original function, the above function reads two lines max from the source file, and will raise a SyntaxError
exception if the encoding in the cookie is invalid or is not UTF-8 while a UTF-8 BOM is present.
Demo:
>>> import tempfile
>>> def test(contents):
... with tempfile.NamedTemporaryFile() as f:
... f.write(contents)
... f.flush()
... return detect_encoding(f.name)
...
>>> test('# -*- coding: utf-8 -*-\n')
'utf-8'
>>> test('#!/bin/env python\n# -*- coding: latin-1 -*-\n')
'iso-8859-1'
>>> test('import this\n')
'ascii'
>>> import codecs
>>> test(codecs.BOM_UTF8 + 'import this\n')
'utf-8-sig'
>>> test(codecs.BOM_UTF8 + '# encoding: latin-1\n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in test
File "<string>", line 37, in detect_encoding
File "<string>", line 24, in find_cookie
SyntaxError: encoding problem for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpxsqH8L': utf-8
>>> test('# encoding: foobarbaz\n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in test
File "<string>", line 37, in detect_encoding
File "<string>", line 18, in find_cookie
SyntaxError: unknown encoding for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpHiHdG3': foobarbaz
这篇关于获取在魔术行/ shebang中指定的编码(从模块内部)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!