问题描述
到目前为止,我能找到的最接近的竞争者是 yEnc (2%) 和 ASCII85(25% 的开销).yEnc 似乎存在一些问题,主要是因为它使用 8 位字符集.这引出了另一个想法:是否存在基于 UTF-8 字符集的二进制到文本编码?
The closest contenders that I could find so far are yEnc (2%) and ASCII85 (25% overhead). There seem to be some issues around yEnc mainly around the fact that it uses an 8-bit character set. Which leads to another thought: is there a binary to text encoding based on the UTF-8 character set?
推荐答案
这实际上取决于二进制数据的性质,以及文本"对输出的约束.
This really depends on the nature of the binary data, and the constraints that "text" places on your output.
首先,如果您的二进制数据未压缩,请在编码前尝试压缩.然后我们可以假设 1/0 或单个字节的分布或多或少是随机的.
First off, if your binary data is not compressed, try compressing before encoding. We can then assume that the distribution of 1/0 or individual bytes is more or less random.
现在:你为什么需要文字?通常,这是因为通信通道不会平等地通过所有字符.例如您可能需要纯 ASCII 文本,其可打印字符范围为 0x20-0x7E.你有 95 个角色可以玩.每个字符理论上可以编码 log2(95) ~= 每个字符 6.57 位.定义一个非常接近的变换很容易.
Now: why do you need text? Typically, it's because the communication channel does not pass through all characters equally. e.g. you may require pure ASCII text, whose printable characters range from 0x20-0x7E. You have 95 characters to play with. Each character can theoretically encode log2(95) ~= 6.57 bits per character. It's easy to define a transform that comes pretty close.
但是:如果你需要一个分隔符怎么办?现在你只有 94 个字符,等等.所以编码的选择真的取决于你的要求.
But: what if you need a separator character? Now you only have 94 characters, etc. So the choice of an encoding really depends on your requirements.
举一个非常愚蠢的例子:如果您的频道通过所有 256 个字符没有问题,并且您不需要任何分隔符,那么您可以编写一个实现 100% 效率的简单转换.:-) 如何做到这一点留给读者作为练习.
To take an extremely stupid example: if your channel passes all 256 characters without issues, and you don't need any separators, then you can write a trivial transform that achieves 100% efficiency. :-) How to do so is left as an exercise for the reader.
UTF-8 不是用于任意编码的二进制数据的良好传输方式.它能够以仅 14% 的开销传输值 0x01-0x7F.我不确定 0x00 是否合法;可能不会.但是 0x80 以上的任何内容都会在 UTF-8 中扩展为多个字节.我会将 UTF-8 视为传递 0x01-0x7F 或 126 个唯一字符的受限通道.如果您不需要分隔符,那么您可以传输每个字符 6.98 位.
UTF-8 is not a good transport for arbitrarily encoded binary data. It is able to transport values 0x01-0x7F with only 14% overhead. I'm not sure if 0x00 is legal; likely not. But anything above 0x80 expands to multiple bytes in UTF-8. I'd treat UTF-8 as a constrained channel that passes 0x01-0x7F, or 126 unique characters. If you don't need delimeters then you can transmit 6.98 bits per character.
这个问题的一般解决方案:假设一个由 N 个字符组成的字母表,其二进制编码为 0 到 N-1.(如果编码不符合假设,则使用查找表在我们的中间 0..N-1 表示与您实际发送和接收的内容之间进行转换.)
A general solution to this problem: assume an alphabet of N characters whose binary encodings are 0 to N-1. (If the encodings are not as assumed, then use a lookup table to translate between our intermediate 0..N-1 representation and what you actually send and receive.)
假设字母表中有 95 个字符.现在:这些符号中的一些将表示 6 位,一些将表示 7 位.如果我们有 A 6 位符号和 B 7 位符号,则:
Assume 95 characters in the alphabet. Now: some of these symbols will represent 6 bits, and some will represent 7 bits. If we have A 6-bit symbols and B 7-bit symbols, then:
A+B=95(符号总数)2A+B=128(可以做的7位前缀的总数.你可以以6位符号开头2个前缀,或以7位符号开头一个.)
A+B=95 (total number of symbols)2A+B=128 (total number of 7-bit prefixes that can be made. You can start 2 prefixes with a 6-bit symbol, or one with a 7-bit symbol.)
求解系统,你得到:A=33,B=62.您现在构建一个符号表:原始编码000000 0000000000001 0000001...100000 01000001000010 01000011000011 0100010...1111110 10111011111111 1011110
Solving the system, you get: A=33, B=62. You now build a table of symbols:
Raw Encoded000000 0000000000001 0000001...100000 01000001000010 01000011000011 0100010...1111110 10111011111111 1011110要编码,首先移出 6 位输入.如果这六位大于或等于 100001,则再移位一位.然后查找对应的7位输出代码,转换为适合输出空间并发送.您将在每次迭代中移动 6 或 7 位输入.
To encode, first shift off 6 bits of input. If those six bits are greater or equal to 100001 then shift another bit. Then look up the corresponding 7-bit output code, translate to fit in the output space and send. You will be shifting 6 or 7 bits of input each iteration.
要解码,接受一个字节并转换为原始输出代码.如果原始代码小于 0100001,则将相应的 6 位移到您的输出上.否则将相应的 7 位移到您的输出上.您将在每次迭代中生成 6-7 位的输出.
To decode, accept a byte and translate to raw output code. If the raw code is less than 0100001 then shift the corresponding 6 bits onto your output. Otherwise shift the corresponding 7 bits onto your output. You will be generating 6-7 bits of output each iteration.
对于均匀分布的数据,我认为这是最佳的.如果您知道源代码中的 0 多于 1,那么您可能希望将 7 位代码映射到空间的开头,以便更有可能使用 7 位代码.
For uniformly distributed data I think this is optimal. If you know that you have more zeros than ones in your source, then you might want to map the 7-bit codes to the start of the space so that it is more likely that you can use a 7-bit code.
这篇关于什么是最有效的二进制到文本编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!