获取在魔术行/ shebang中指定的编码（从模块内部）

本文介绍了获取在魔术行/ shebang中指定的编码（从模块内部）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果我指定字符编码（如所建议）在python模块的魔术线或shebang中，例如

 ＃-*-编码：utf-8-*-

我可以从该模块中检索此编码吗？

（在Windows 7 x64上使用Python 2.7.9进行操作）

我尝试（未成功）检索了默认编码或shebang

 ＃-*-编码：utf-8-*-
 
导入系统从shebang导入
导入shebang 
 
打印 sys.getdefaultencoding（）：，sys.getdefaultencoding（）
打印 shebang：，shebang（__file __。rstrip（ oc ））

将产生：

（与iso-8859-1相同）

解决方案

我会借用Python 3 在Python 2中进行了一些调整，以符合Python 2的期望。我已将函数签名更改为接受文件名，并删除了到目前为止已包含的行；您不需要使用它们：

  import re 
从编解码器导入查找，BOM_UTF8 
 
 cookie_re = re.compile（r'^ [\t\f] *＃。*？coding [：=] [\t] *（[-\w。] +）'）
 blank_re = re.compile（br'^ [\t\f] *（？：[＃\r\n] | $）'）
 
 def _get_normal_name（orig_enc ）：
在tokenizer.c中模仿get_normal_name。 
＃只关心前12个字符。 
 enc = orig_enc [：12] .lower（）。replace（ _，-）
如果enc == utf-8或enc.startswith（ utf-8- ）：
如果enc in（ latin-1， iso-8859-1， iso-latin-1）或\ 
 enc，则返回 utf-8 
 .startswith（（ latin-1-， iso-8859-1-， iso-latin-1-））：
返回 iso-8859-1 
返回orig_enc 
 
 def detect_encoding（文件名）：
 bom_found = False 
编码= None 
默认='ascii'
 
 def find_cookie（line）： 
 match = cookie_re.match（line）
如果不匹配：
返回None 
编码= _get_normal_name（match.group（1））
 try：
 codec = lookup（encoding）
，除了LookupError：
＃此行为类似于Python解释器
提高SyntaxError（
 {！r}的未知编码：{}。format （
文件名，编码））
 
如果bom_found：
 if encoding！='utf-8'：
＃此行为模仿Python解释器
提高SyntaxError（
'{！r}的编码问题：utf-8'.format（filename） ）
编码+ ='-sig'
返回编码
 
，其中open（filename，'rb'）as fileobj：
 first = next（fileobj，'' ）
 if first.startswith（BOM_UTF8）：
 bom_found = True 
 first = first [3：] 
 default ='utf-8-sig'
 first：
返回默认值
 
编码= find_cookie（first）
如果编码：
返回编码
如果非blank_re.match（first）：
返回默认值
 
第二个= next（fileobj，''）
 
如果不是第二个：
返回默认值
返回find_cookie（第二个）或默认值

就像原来的功能一样，上面的乐趣该部分从源文件中读取两行 max ，如果cookie中的编码无效或不是UTF-8，则会引发 SyntaxError 异常

演示：

  >>>导入临时文件
>>> def测试（内容）：
 ... tempfile.NamedTemporaryFile（）为f：
 ... f.write（内容）
 ... f.flush（）
 ... return detect_encoding（f.name）
 ... 
>> test（’＃-*-编码：utf-8-*-\n’）
‘utf-8’
>>> test（'＃！/ bin / env python\n＃-*-编码：latin-1-*-\n'）
'iso-8859-1'
>>> ;测试（导入this\n）
‘ascii’
>>>导入编解码器
>>>测试（codecs.BOM_UTF8 +导入此\n）
 utf-8-sig 
>>>测试（codecs.BOM_UTF8 +‘＃编码：latin-1\n’）
追溯（最近一次调用）：
文件< stdin>，< module>中的第1行
文件< stdin>，第5行，在测试中
文件< string>，第37行，在detect_encoding 
文件< string>，第24行，在find_cookie 
 SyntaxError：'/ var / folders / w0 / nl1bwj6163j2pvxswf84xcsjh2pc5g / T / tmpxsqH8L'的编码问题：utf-8 
>>测试（＃编码：foobarbaz\n）
追溯（最近一次通话结束）：
文件< stdin>，在< module>中的第1行
文件< stdin>，第5行，在测试中
文件< string>，行37，在detect_encoding 
文件< string>，第18行，在find_cookie 
语法错误：'/ var / folders / w0 / nl1bwj6163j2pvxswf84xcsjh2pc5g / T / tmpHiHdG3'的未知编码：foobarbaz

If I specify the character encoding (as suggested by PEP 263) in the "magic line" or shebang of a python module like

# -*- coding: utf-8 -*-

can I retrieve this encoding from within that module?

(Working on Windows 7 x64 with Python 2.7.9)

I tried (without success) to retrieve the default encoding or shebang

# -*- coding: utf-8 -*-

import sys
from shebang import shebang

print "sys.getdefaultencoding():", sys.getdefaultencoding()
print "shebang:", shebang( __file__.rstrip("oc"))

will yield:

(same for iso-8859-1)

解决方案

I'd borrow the Python 3 tokenize.detect_encoding() function in Python 2, adjusted a little to match Python 2 expectations. I've changed the function signature to accept a filename and dropped the inclusion of the lines read so far; you don't need those for your usecase:

import re
from codecs import lookup, BOM_UTF8

cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)')
blank_re = re.compile(br'^[ \t\f]*(?:[#\r\n]|$)')

def _get_normal_name(orig_enc):
    """Imitates get_normal_name in tokenizer.c."""
    # Only care about the first 12 characters.
    enc = orig_enc[:12].lower().replace("_", "-")
    if enc == "utf-8" or enc.startswith("utf-8-"):
        return "utf-8"
    if enc in ("latin-1", "iso-8859-1", "iso-latin-1") or \
       enc.startswith(("latin-1-", "iso-8859-1-", "iso-latin-1-")):
        return "iso-8859-1"
    return orig_enc

def detect_encoding(filename):
    bom_found = False
    encoding = None
    default = 'ascii'

    def find_cookie(line):
        match = cookie_re.match(line)
        if not match:
            return None
        encoding = _get_normal_name(match.group(1))
        try:
            codec = lookup(encoding)
        except LookupError:
            # This behaviour mimics the Python interpreter
            raise SyntaxError(
                "unknown encoding for {!r}: {}".format(
                    filename, encoding))

        if bom_found:
            if encoding != 'utf-8':
                # This behaviour mimics the Python interpreter
                raise SyntaxError(
                    'encoding problem for {!r}: utf-8'.format(filename))
            encoding += '-sig'
        return encoding

    with open(filename, 'rb') as fileobj:
        first = next(fileobj, '')
        if first.startswith(BOM_UTF8):
            bom_found = True
            first = first[3:]
            default = 'utf-8-sig'
        if not first:
            return default

        encoding = find_cookie(first)
        if encoding:
            return encoding
        if not blank_re.match(first):
            return default

        second = next(fileobj, '')

    if not second:
        return default
    return find_cookie(second) or default

Like the original function, the above function reads two lines max from the source file, and will raise a SyntaxError exception if the encoding in the cookie is invalid or is not UTF-8 while a UTF-8 BOM is present.

Demo:

>>> import tempfile
>>> def test(contents):
...     with tempfile.NamedTemporaryFile() as f:
...         f.write(contents)
...         f.flush()
...         return detect_encoding(f.name)
...
>>> test('# -*- coding: utf-8 -*-\n')
'utf-8'
>>> test('#!/bin/env python\n# -*- coding: latin-1 -*-\n')
'iso-8859-1'
>>> test('import this\n')
'ascii'
>>> import codecs
>>> test(codecs.BOM_UTF8 + 'import this\n')
'utf-8-sig'
>>> test(codecs.BOM_UTF8 + '# encoding: latin-1\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in test
  File "<string>", line 37, in detect_encoding
  File "<string>", line 24, in find_cookie
SyntaxError: encoding problem for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpxsqH8L': utf-8
>>> test('# encoding: foobarbaz\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in test
  File "<string>", line 37, in detect_encoding
  File "<string>", line 18, in find_cookie
SyntaxError: unknown encoding for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpHiHdG3': foobarbaz

这篇关于获取在魔术行/ shebang中指定的编码（从模块内部）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！