问题描述
我正在挖掘一些包含(除其他外)文本的旧二进制文件.他们的文本经常使用自定义字符编码作为 Reasons,我希望能够读取和重写它们.
I'm digging through some old binaries that contain (among other things) text. Their text frequently uses custom character encodings for Reasons, and I want to be able to read and rewrite them.
在我看来,执行此操作的适当方法是使用 标准编解码器库.不幸的是,它的文档既庞大又完全没有示例.Google 出现了一些,但仅适用于 python2,而我使用的是 3.
It seems to me that the appropriate way to do this is to create a custom codec using the standard codecs library. Unfortunately its documentation is both colossal and entirely bereft of examples. Google turns up a few, but only for python2, and I'm using 3.
我正在寻找有关如何使用编解码器库实现自定义字符编码的最小示例.
I'm looking for a minimal example of how to use the codecs library to implement a custom character encoding.
推荐答案
您要求的最少!
- 编写一个编码函数和一个解码函数.
- 编写一个搜索函数",返回一个由上述编码器和解码器构造的
CodecInfo
对象. - 使用 codec.register 注册一个返回的函数上面的
CodecInfo
对象.
- Write a encode function and a decode function.
- Write a "search function" that returns a
CodecInfo
object constructed from the above encoder and decoder. - Use codec.register to register a function that returns the above
CodecInfo
object.
这是一个将小写字母 a-z 依次转换为 0-25 的示例.
Here is an example that converts the lowercase letters a-z to 0-25 in order.
import codecs
import string
from typing import Tuple
# prepare map from numbers to letters
_encode_table = {str(number): bytes(letter, 'ascii') for number, letter in enumerate(string.ascii_lowercase)}
# prepare inverse map
_decode_table = {ord(v): k for k, v in _encode_table.items()}
def custom_encode(text: str) -> Tuple[bytes, int]:
# example encoder that converts ints to letters
# see https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
return b''.join(_encode_table[x] for x in text), len(text)
def custom_decode(binary: bytes) -> Tuple[str, int]:
# example decoder that converts letters to ints
# see https://docs.python.org/3/library/codecs.html#codecs.Codec.decode
return ''.join(_decode_table[x] for x in binary), len(binary)
def custom_search_function(encoding_name):
return codecs.CodecInfo(custom_encode, custom_decode, name='Reasons')
def main():
# register your custom codec
# note that CodecInfo.name is used later
codecs.register(custom_search_function)
binary = b'abcdefg'
# decode letters to numbers
text = codecs.decode(binary, encoding='Reasons')
print(text)
# encode numbers to letters
binary2 = codecs.encode(text, encoding='Reasons')
print(binary2)
# encode(decode(...)) should be an identity function
assert binary == binary2
if __name__ == '__main__':
main()
运行此打印
$ python codec_example.py
0123456
b'abcdefg'
参见 https://docs.python.org/3/library/codecs.html#codec-objects 有关 Codec
界面的详细信息.特别是解码功能
See https://docs.python.org/3/library/codecs.html#codec-objects for details on the Codec
interface. In particular, the decode function
... 解码对象 input 并返回一个元组(输出对象,长度消耗).
而编码功能
... 编码对象 input 并返回一个元组(输出对象,消耗的长度).
请注意,您还应该担心处理流、增量编码/解码以及错误处理.有关更完整的示例,请参阅 hexlify 编解码器@krs013 提到的.
Note that you should also worry about handling streams, incremental encoding/decoding, as well as error handling. For a more complete example, refer to the hexlify codec that @krs013 mentioned.
附言除了 codec.decode
,您还可以使用 codec.open(..., encoding='Reasons')
.
P.S. instead of of codec.decode
, you can also use codec.open(..., encoding='Reasons')
.
这篇关于如何正确创建自定义文本编解码器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!