python：UnicodeDecodeError：'utf8'编解码器无法解码位置0的0xc0字节：无效的起始字节

本文介绍了python：UnicodeDecodeError：'utf8'编解码器无法解码位置0的0xc0字节：无效的起始字节的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试通过创建随机的utf-8编码字符串然后将其解码为unicode来编写一个生成随机Unicode的脚本。对于单个字节，它工作正常，但是两个字节失败。

例如，如果我在python shell中运行以下命令：

>>> a = str（）

>>> a + = chr（0xc0）+ chr（0xaf）

>>>打印a.decode（'utf-8'）

  UnicodeDecodeError：'utf8' t解码字节0xc0在位置0：无效起始字节

根据utf-8方案字节序列 0xc0 0xaf 应该有效，因为 0xc0 以 110 开始， 0xaf 以 10开始。

这是我的python脚本：

  def unicode（self）：
'''一个随机（星号）utf编码字节串'''
 num_bytes = random.randint（1,4）
如果num_bytes == 1：
返回self.gen_utf8（num_bytes，0x00，0x7F ）
 elif num_bytes == 2：
 return self.gen_utf8（num_bytes，0xC0，0xDF）
 elif num_bytes == 3：
 return self.gen_utf8（num_bytes，0xE0，0xEF ）
 el如果num_bytes == 4：
返回self.gen_utf8（num_bytes，0xF0，0xF7）
 
 def gen_utf8（self，num_bytes，start_val，end_val）：
 byte_str = list（） 
 byte_str.append（random.randrange（start_val，end_val））＃开始字节
对于范围（0，num_bytes-1）中的i：
 byte_str.append（random.randrange（0x80， 0xBF））＃尾随字节
a = str（）
 sum = int（）
 for byte in byte_str：
a + = chr（b）
 ret = a。解码（'utf-8'）
 return ret 
 
如果__name__ ==__main__：
g = GenFuzz（）
 print g.gen_utf8（2,0xC0 ，0xDF）

解决方案

8。在UTF-8中，只能使用两个字节来对U + 0080到U + 07FF（包含）范围内的代码点进行编码。仔细阅读维基百科的文章，你会看到同样的事情。结果，字节 0xc0 可能不会出现在UTF-8中。 0xc1 也是如此。

某些UTF-8解码器错误地解码了序列，如 C0 AF 作为有效的UTF-8，这导致过去的安全漏洞。

I'm trying to write a script that generates random unicode by creating random utf-8 encoded strings and then decoding those to unicode. It works fine for a single byte, but with two bytes it fails.

For instance, if I run the following in a python shell:

>>> a = str()

>>> a += chr(0xc0) + chr(0xaf)

>>> print a.decode('utf-8')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte

According to the utf-8 scheme https://en.wikipedia.org/wiki/UTF-8#Description the byte sequence 0xc0 0xaf should be valid as 0xc0 starts with 110 and 0xaf starts with 10.

Here's my python script:

def unicode(self):
    '''returns a random (astral) utf encoded byte string'''
    num_bytes = random.randint(1,4)
    if num_bytes == 1:
        return self.gen_utf8(num_bytes, 0x00, 0x7F)
    elif num_bytes == 2:
        return self.gen_utf8(num_bytes, 0xC0, 0xDF)
    elif num_bytes == 3:
        return self.gen_utf8(num_bytes, 0xE0, 0xEF)
    elif num_bytes == 4:
        return self.gen_utf8(num_bytes, 0xF0, 0xF7)

def gen_utf8(self, num_bytes, start_val, end_val):
    byte_str = list()
    byte_str.append(random.randrange(start_val, end_val)) # start byte
    for i in range(0,num_bytes-1):
        byte_str.append(random.randrange(0x80,0xBF)) # trailing bytes
    a = str()
    sum = int()
    for b in byte_str:
        a += chr(b)
    ret = a.decode('utf-8')
    return ret

if __name__ == "__main__":
    g = GenFuzz()
    print g.gen_utf8(2,0xC0,0xDF)

解决方案

This is, indeed, invalid UTF-8. In UTF-8, only code points in the range U+0080 to U+07FF, inclusive, can be encoded using two bytes. Read the Wikipedia article more closely, and you will see the same thing. As a result, the byte 0xc0 may not appear in UTF-8, ever. The same is true of 0xc1.

Some UTF-8 decoders have erroneously decoded sequences like C0 AF as valid UTF-8, which has lead to security vulnerabilities in the past.

这篇关于python：UnicodeDecodeError：'utf8'编解码器无法解码位置0的0xc0字节：无效的起始字节的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！