python - urllib.request.urlopen返回字节，但是我无法对其进行解码

This question already has answers here:

urllib2 opener providing wrong charset

(2个答案)

5年前关闭。

我尝试使用urllib.request的urlopen()方法解析网页，例如:

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

但是，最后一行以字节为单位返回结果。

所以我尝试将其解码，例如:

html = urlopen(req).read().decode("utf-8")

但是，发生错误:

通过一些研究，我发现one related answer，它将解析charset来决定解码。但是，该页面不会返回字符集，并且当我尝试在Chrome Web Inspector上对其进行检查时，在其标题中写了以下行:

<meta charset="utf-8">

那为什么不能用utf-8解码呢？以及如何成功解析网页？

网站URL是http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2，我要将图像保存到磁盘上。

请注意，我使用的是Python 3.5.1。我还注意到，我在上面编写的所有工作在其他其他抓取程序中均运行良好。

最佳答案

内容使用gzip压缩。您需要解压缩它:

import gzip
from urllib.request import Request, urlopen

req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')

如果使用 requests ，它将为您自动解压缩:

import requests
html = requests.get(url).text  # => str, not bytes

关于python - urllib.request.urlopen返回字节，但是我无法对其进行解码，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/35122232/