编解码器无法解码位置0的0xc0字节

编解码器无法解码位置0的0xc0字节

本文介绍了python:UnicodeDecodeError:'utf8'编解码器无法解码位置0的0xc0字节:无效的起始字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过创建随机的utf-8编码字符串然后将其解码为unicode来编写一个生成随机Unicode的脚本。对于单个字节,它工作正常,但是两个字节失败。



例如,如果我在python shell中运行以下命令:



>>> a = str()



>>> a + = chr(0xc0)+ chr(0xaf)



>>>打印a.decode('utf-8')

  UnicodeDecodeError:'utf8' t解码字节0xc0在位置0:无效起始字节

根据utf-8方案字节序列 0xc0 0xaf 应该有效,因为 0xc0 110 开始, 0xaf 10开始






这是我的python脚本:

  def unicode(self):
'''一个随机(星号)utf编码字节串'''
num_bytes = random.randint(1,4)
如果num_bytes == 1:
返回self.gen_utf8(num_bytes,0x00,0x7F )
elif num_bytes == 2:
return self.gen_utf8(num_bytes,0xC0,0xDF)
elif num_bytes == 3:
return self.gen_utf8(num_bytes,0xE0,0xEF )
el如果num_bytes == 4:
返回self.gen_utf8(num_bytes,0xF0,0xF7)

def gen_utf8(self,num_bytes,start_val,end_val):
byte_str = list()
byte_str.append(random.randrange(start_val,end_val))#开始字节
对于范围(0,num_bytes-1)中的i:
byte_str.append(random.randrange(0x80, 0xBF))#尾随字节
a = str()
sum = int()
for byte in byte_str:
a + = chr(b)
ret = a。解码('utf-8')
return ret

如果__name__ ==__main__:
g = GenFuzz()
print g.gen_utf8(2,0xC0 ,0xDF)


解决方案

8。在UTF-8中,只能使用两个字节来对U + 0080到U + 07FF(包含)范围内的代码点进行编码。仔细阅读维基百科的文章,你会看到同样的事情。结果,字节 0xc0 可能不会出现在UTF-8中。 0xc1 也是如此。



某些UTF-8解码器错误地解码了序列,如 C0 AF 作为有效的UTF-8,这导致过去的安全漏洞。


I'm trying to write a script that generates random unicode by creating random utf-8 encoded strings and then decoding those to unicode. It works fine for a single byte, but with two bytes it fails.

For instance, if I run the following in a python shell:

>>> a = str()

>>> a += chr(0xc0) + chr(0xaf)

>>> print a.decode('utf-8')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte

According to the utf-8 scheme https://en.wikipedia.org/wiki/UTF-8#Description the byte sequence 0xc0 0xaf should be valid as 0xc0 starts with 110 and 0xaf starts with 10.


Here's my python script:

def unicode(self):
    '''returns a random (astral) utf encoded byte string'''
    num_bytes = random.randint(1,4)
    if num_bytes == 1:
        return self.gen_utf8(num_bytes, 0x00, 0x7F)
    elif num_bytes == 2:
        return self.gen_utf8(num_bytes, 0xC0, 0xDF)
    elif num_bytes == 3:
        return self.gen_utf8(num_bytes, 0xE0, 0xEF)
    elif num_bytes == 4:
        return self.gen_utf8(num_bytes, 0xF0, 0xF7)

def gen_utf8(self, num_bytes, start_val, end_val):
    byte_str = list()
    byte_str.append(random.randrange(start_val, end_val)) # start byte
    for i in range(0,num_bytes-1):
        byte_str.append(random.randrange(0x80,0xBF)) # trailing bytes
    a = str()
    sum = int()
    for b in byte_str:
        a += chr(b)
    ret = a.decode('utf-8')
    return ret

if __name__ == "__main__":
    g = GenFuzz()
    print g.gen_utf8(2,0xC0,0xDF)
解决方案

This is, indeed, invalid UTF-8. In UTF-8, only code points in the range U+0080 to U+07FF, inclusive, can be encoded using two bytes. Read the Wikipedia article more closely, and you will see the same thing. As a result, the byte 0xc0 may not appear in UTF-8, ever. The same is true of 0xc1.

Some UTF-8 decoders have erroneously decoded sequences like C0 AF as valid UTF-8, which has lead to security vulnerabilities in the past.

这篇关于python:UnicodeDecodeError:'utf8'编解码器无法解码位置0的0xc0字节:无效的起始字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 04:46