问题描述
我正在尝试通过创建随机的utf-8编码字符串然后将其解码为unicode来编写一个生成随机Unicode的脚本。对于单个字节,它工作正常,但是两个字节失败。例如,如果我在python shell中运行以下命令:
>>> a = str()
>>> a + = chr(0xc0)+ chr(0xaf)
>>>打印a.decode('utf-8')
UnicodeDecodeError:'utf8' t解码字节0xc0在位置0:无效起始字节
根据utf-8方案字节序列 0xc0 0xaf
应该有效,因为 0xc0
以 110
开始, 0xaf
以 10开始
。
这是我的python脚本:
def unicode(self):
'''一个随机(星号)utf编码字节串'''
num_bytes = random.randint(1,4)
如果num_bytes == 1:
返回self.gen_utf8(num_bytes,0x00,0x7F )
elif num_bytes == 2:
return self.gen_utf8(num_bytes,0xC0,0xDF)
elif num_bytes == 3:
return self.gen_utf8(num_bytes,0xE0,0xEF )
el如果num_bytes == 4:
返回self.gen_utf8(num_bytes,0xF0,0xF7)
def gen_utf8(self,num_bytes,start_val,end_val):
byte_str = list()
byte_str.append(random.randrange(start_val,end_val))#开始字节
对于范围(0,num_bytes-1)中的i:
byte_str.append(random.randrange(0x80, 0xBF))#尾随字节
a = str()
sum = int()
for byte in byte_str:
a + = chr(b)
ret = a。解码('utf-8')
return ret
如果__name__ ==__main__:
g = GenFuzz()
print g.gen_utf8(2,0xC0 ,0xDF)
8。在UTF-8中,只能使用两个字节来对U + 0080到U + 07FF(包含)范围内的代码点进行编码。仔细阅读维基百科的文章,你会看到同样的事情。结果,字节 0xc0
可能不会出现在UTF-8中。 0xc1
也是如此。
某些UTF-8解码器错误地解码了序列,如 C0 AF
作为有效的UTF-8,这导致过去的安全漏洞。
I'm trying to write a script that generates random unicode by creating random utf-8 encoded strings and then decoding those to unicode. It works fine for a single byte, but with two bytes it fails.
For instance, if I run the following in a python shell:
>>> a = str()
>>> a += chr(0xc0) + chr(0xaf)
>>> print a.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte
According to the utf-8 scheme https://en.wikipedia.org/wiki/UTF-8#Description the byte sequence 0xc0 0xaf
should be valid as 0xc0
starts with 110
and 0xaf
starts with 10
.
Here's my python script:
def unicode(self):
'''returns a random (astral) utf encoded byte string'''
num_bytes = random.randint(1,4)
if num_bytes == 1:
return self.gen_utf8(num_bytes, 0x00, 0x7F)
elif num_bytes == 2:
return self.gen_utf8(num_bytes, 0xC0, 0xDF)
elif num_bytes == 3:
return self.gen_utf8(num_bytes, 0xE0, 0xEF)
elif num_bytes == 4:
return self.gen_utf8(num_bytes, 0xF0, 0xF7)
def gen_utf8(self, num_bytes, start_val, end_val):
byte_str = list()
byte_str.append(random.randrange(start_val, end_val)) # start byte
for i in range(0,num_bytes-1):
byte_str.append(random.randrange(0x80,0xBF)) # trailing bytes
a = str()
sum = int()
for b in byte_str:
a += chr(b)
ret = a.decode('utf-8')
return ret
if __name__ == "__main__":
g = GenFuzz()
print g.gen_utf8(2,0xC0,0xDF)
This is, indeed, invalid UTF-8. In UTF-8, only code points in the range U+0080 to U+07FF, inclusive, can be encoded using two bytes. Read the Wikipedia article more closely, and you will see the same thing. As a result, the byte 0xc0
may not appear in UTF-8, ever. The same is true of 0xc1
.
Some UTF-8 decoders have erroneously decoded sequences like C0 AF
as valid UTF-8, which has lead to security vulnerabilities in the past.
这篇关于python:UnicodeDecodeError:'utf8'编解码器无法解码位置0的0xc0字节:无效的起始字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!