问题描述
为什么以下两种解码方法返回不同的结果?
Why the following two decoding methods return different results?
>>> import codecs
>>>
>>> data = ['', '', 'a', '']
>>> list(codecs.iterdecode(data, 'utf-8'))
[u'a']
>>> [codecs.decode(i, 'utf-8') for i in data]
[u'', u'', u'a', u'']
这是错误还是预期的行为?我的Python版本2.7.13.
Is this a bug or expected behavior? My Python version 2.7.13.
推荐答案
这很正常. iterdecode
在编码的块上使用迭代器,并在解码的块上返回迭代器,但是它不保证一对一的对应关系.它保证的是所有输出块的串联都是对所有输入块的串联的有效解码.
This is normal. iterdecode
takes an iterator over encoded chunks and returns an iterator over decoded chunks, but it doesn't promise a one-to-one correspondence. All it guarantees is that the concatenation of all output chunks is a valid decoding of the concatenation of all input chunks.
如果您查看源代码 ,您会看到它明确地丢弃了空的输出块:
If you look at the source code, you'll see it's explicitly discarding empty output chunks:
def iterdecode(iterator, encoding, errors='strict', **kwargs):
"""
Decoding iterator.
Decodes the input strings from the iterator using an IncrementalDecoder.
errors and kwargs are passed through to the IncrementalDecoder
constructor.
"""
decoder = getincrementaldecoder(encoding)(errors, **kwargs)
for input in iterator:
output = decoder.decode(input)
if output:
yield output
output = decoder.decode("", True)
if output:
yield output
请注意,原因iterdecode
存在,而您自己不会仅对所有块调用decode
的原因是,解码过程是有状态的.一个字符的UTF-8编码形式可能会分成多个块.其他编解码器可能确实具有怪异的状态行为,例如字节序列可以反转所有字符的大小写,直到您再次看到该字节序列为止.
Be aware that the reason iterdecode
exists, and the reason you wouldn't just call decode
on all the chunks yourself, is that the decoding process is stateful. The UTF-8 encoded form of one character might be split over multiple chunks. Other codecs might have really weird stateful behavior, like maybe a byte sequence that inverts the case of all characters until you see that byte sequence again.
这篇关于为什么codecs.iterdecode()吃空字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!