本文介绍了我如何以编程方式找到Python知道的编解码器列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道我可以做以下事情:

 >> import encodings,pprint 
>>>> pprint.pprint(sorted(encodings.aliases.aliases.values()))
['ascii',
'base64_codec',
'big5',
'big5hkscs'
'bz2_codec',
'cp037',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'cp424',
'cp437',
'cp500',
'cp775',
'cp850',
'cp852 ',
'cp855',
'cp857',
'cp860',
'cp861',
'cp863'
'cp864',
'cp865',
'cp866',
'cp869',
'cp932',
'cp949',
'cp950',
'euc_jis_2004',
'euc_jisx0213',
'euc_jp',
'euc_kr',
'gb18030',
'gb2312',
'gbk',
'hex_codec',
'hp_roman8',
'hz',
'iso2022_jp',
'iso2022_jp_1 ',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'iso8859_10'
'iso8859_11',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'johab',
'koi8_r',
'latin_1',
'mac_cyrillic',
'mac_greek',
'mac_iceland ',
'mac_latin2',
'mac_roman',
'mac_turkish',
'mbcs',
'ptcp154',
'quopri_codec'
'rot_13',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'tactis',
'tis_620',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_7',
'utf_8',
'uu_codec',
'zlib_codec']

我也知道这不是一个完整的列表,因为它包括只有编码存在一个别名(例如cp737缺失),至少有一些伪编码丢失(例如string_escape)。



由于问题的标题说:如何以编程方式获取Python的所有编解码器/编码的列表?



如果不是以编程方式:是否有完整的在线列表?

解决方案

我不认为完整的列表存储在python标准库中的任何地方。相反,编码是通过调用 encoding.search_function(encoding)按需加载的。如果你研究的代码,看起来像编码字符串首先规范化,然后编码包搜索子模块 pkgutil

列出编码的所有子模块,然后将它们添加到 encoding.aliases.aliases 中列出的子模块。 p>

不幸的是, encoding.aliases.aliases 包含一个编码, tactis 这不是由上面生成的,所以我试图通过联合两个集合生成完整的列表。

  import编码
import os
import pkgutil

modnames = set([impname的modname,modname,ispkg in pkgutil.walk_packages(
path = [os.path.dirname (encodings .__ file__)],prefix ='')])
aliases = set(encodings.aliases.aliases.values())

print(modnames-aliases)
#set(['charmap','unicode_escape','cp1006','unicode_internal','punycode','string_escape','aliases','palmos','mac_centeuro','mac_farsi','mac_romanian','cp856 ','raw_unicode_escape','mac_croatian','utf_8_sig','mac_arabic','undefined','cp737','idna','koi8_u','cp875','cp874','iso8859_1'])

print(aliases-modnames)
#set(['tactis'])

codec_names = modnames.union(aliases)
print $ b#set(['bz2_codec','cp1140','euc_jp','cp932','punycode','euc_jisx0213','aliases','hex_codec','cp500','uu_codec','big5hkscs' 'mac_romanian','mbcs','euc_jis_2004','iso2022_jp_3','iso2022_jp_2','iso2022_jp_1','gbk','iso2022_jp_2004','unicode_internal','utf_16_be','quopri_codec','cp424','iso2022_jp ','mac_iceland','raw_unicode_escape','hp_roman8','iso2022_kr','cp875','iso8859_6','cp1254','utf_32_be','gb2312','cp850','shift_jis','cp852' 'cp855','iso8859_3','cp857','cp856','cp775','unicode_escape','cp1026','mac_latin2','utf_32','mac_cyrillic','base64_codec','ptcp154' 'mac_centeuro','euc_kr','hz','utf_8','utf_32_le','mac_greek','utf_7','mac_turkish','utf_8_sig','mac_arabic','tactis','cp949' 'zlib_codec','big5','iso8859_9','iso8859_8','iso8859_5','iso8859_4','iso8859_7','cp874','iso8859_1','utf_16_le','iso8859_2','charmap','gb18030 ','cp1006','shift_jis_2004','mac_roman','ascii','string_escape','iso8859_15','iso8859_14','tis_620','iso8859_16','iso8859_11','iso8859_10','iso8859_13' 'cp950','utf_16','cp869','mac_farsi','rot_13','cp860','cp861','cp862','cp863','cp864','cp865','cp866','shift_jisx0213 'cp1255','cp1253','cp1252','cp437','cp1258','cp1255','cp1255','cp1255','cp1255' 'undefined','cp737','koi8_r','cp037','koi8_u','iso2022_jp_ext','idna'])


I know that I can do the following:

>>> import encodings, pprint
>>> pprint.pprint(sorted(encodings.aliases.aliases.values()))
['ascii',
 'base64_codec',
 'big5',
 'big5hkscs',
 'bz2_codec',
 'cp037',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp424',
 'cp437',
 'cp500',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp932',
 'cp949',
 'cp950',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_jp',
 'euc_kr',
 'gb18030',
 'gb2312',
 'gbk',
 'hex_codec',
 'hp_roman8',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'johab',
 'koi8_r',
 'latin_1',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'mbcs',
 'ptcp154',
 'quopri_codec',
 'rot_13',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'tactis',
 'tis_620',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_7',
 'utf_8',
 'uu_codec',
 'zlib_codec']

I also know for sure that this is not a complete list, since it includes only encodings for which an alias exists (e.g "cp737" is missing), and at least some pseudo-encodings are missing (e.g "string_escape").

As the title of the question says: how can I programmatically get a list of all codecs/encodings known to Python?

If not programmatically: is there a complete list available online?

解决方案

I don't think the complete list is stored anywhere in the python standard library. Instead, encodings are loaded on demand through calls to encoding.search_function(encoding). If you study the code there, it looks like encoding string is first normalized and then the encodings package is searched for submodules whose name matches encoding.

The following uses pkgutil to list all the submodules of encoding, and then adds them to those listed in encoding.aliases.aliases.

Unfortunately, encoding.aliases.aliases contains one encoding, tactis that is not generated by the above, so I tried to generate the complete list by union-ing the two sets.

import encodings
import os
import pkgutil

modnames=set([modname for importer, modname, ispkg in pkgutil.walk_packages(
    path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases=set(encodings.aliases.aliases.values())

print(modnames-aliases)
# set(['charmap', 'unicode_escape', 'cp1006', 'unicode_internal', 'punycode', 'string_escape', 'aliases', 'palmos', 'mac_centeuro', 'mac_farsi', 'mac_romanian', 'cp856', 'raw_unicode_escape', 'mac_croatian', 'utf_8_sig', 'mac_arabic', 'undefined', 'cp737', 'idna', 'koi8_u', 'cp875', 'cp874', 'iso8859_1'])

print(aliases-modnames)
# set(['tactis'])

codec_names=modnames.union(aliases)
print(codec_names)
# set(['bz2_codec', 'cp1140', 'euc_jp', 'cp932', 'punycode', 'euc_jisx0213', 'aliases', 'hex_codec', 'cp500', 'uu_codec', 'big5hkscs', 'mac_romanian', 'mbcs', 'euc_jis_2004', 'iso2022_jp_3', 'iso2022_jp_2', 'iso2022_jp_1', 'gbk', 'iso2022_jp_2004', 'unicode_internal', 'utf_16_be', 'quopri_codec', 'cp424', 'iso2022_jp', 'mac_iceland', 'raw_unicode_escape', 'hp_roman8', 'iso2022_kr', 'cp875', 'iso8859_6', 'cp1254', 'utf_32_be', 'gb2312', 'cp850', 'shift_jis', 'cp852', 'cp855', 'iso8859_3', 'cp857', 'cp856', 'cp775', 'unicode_escape', 'cp1026', 'mac_latin2', 'utf_32', 'mac_cyrillic', 'base64_codec', 'ptcp154', 'palmos', 'mac_centeuro', 'euc_kr', 'hz', 'utf_8', 'utf_32_le', 'mac_greek', 'utf_7', 'mac_turkish', 'utf_8_sig', 'mac_arabic', 'tactis', 'cp949', 'zlib_codec', 'big5', 'iso8859_9', 'iso8859_8', 'iso8859_5', 'iso8859_4', 'iso8859_7', 'cp874', 'iso8859_1', 'utf_16_le', 'iso8859_2', 'charmap', 'gb18030', 'cp1006', 'shift_jis_2004', 'mac_roman', 'ascii', 'string_escape', 'iso8859_15', 'iso8859_14', 'tis_620', 'iso8859_16', 'iso8859_11', 'iso8859_10', 'iso8859_13', 'cp950', 'utf_16', 'cp869', 'mac_farsi', 'rot_13', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'shift_jisx0213', 'johab', 'mac_croatian', 'cp1255', 'latin_1', 'cp1257', 'cp1256', 'cp1251', 'cp1250', 'cp1253', 'cp1252', 'cp437', 'cp1258', 'undefined', 'cp737', 'koi8_r', 'cp037', 'koi8_u', 'iso2022_jp_ext', 'idna'])

这篇关于我如何以编程方式找到Python知道的编解码器列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-29 07:57