使用Python请求获取HTML

使用Python请求获取HTML

本文介绍了使用Python请求获取HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图教自己一些基本的网页抓取。使用Python的请求模块,我可以抓取各种网站的html,直到我试过这个:

 >>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')

取而代之的是这个页面的源代码的基本html,我得到:

 >>> r.text 
'\x1f\\\�\x08\x00\x00\x00\x00\x00\x00\x03\\\�] o\\\۸\x12\\ \�\\\�\\\� + \\\�] ...

>>> r.content
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d] o\xdb \xb8\x12\x86\xef\xfb +\x88] \x14h ...

我已经尝试了许多get / post的组合,我可以从文档和SO以及其他示例中猜测每种语法。我不明白我在上面看到什么,没有把它变成我能阅读的任何东西,也无法弄清楚如何得到我真正想要的东西。我的问题是,如何获得上述页面的html?

解决方案

有问题的服务器给你一个压缩响应。服务器也很坏;它会发送以下标题:

  $ curl -D  -  -o / dev / null -s -H'Accept-Encoding:gzip,deflate'http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F 
HTTP / 1.1 200 OK
日期:2015年1月06日星期二17:46:49 GMT
服务器:Apache
<!DOCTYPE HTML PUBLIC - // W3C // DTD XHTML 1.0 Transitional // ENDTD / xhtml1-transitional.dtd>< html xmlns =http://www.w3.org/1999/xhtmllang =en-US>
Vary:Accept-Encoding
Content-Encoding:gzip
Content-Length:3659
Content-Type:text / html

<!DOCTYPE ..> 行有不是有效的HTTP标头即可。因此,通过 Server 的其余标题被忽略 。为什么服务器插入的内容不清楚;在所有可能的风险敞口 WRCCWrappers.py 是一个CGI脚本,它不输出标题,但在doctype行之后包含一个双重换行符,将Apache服务器复制到其中插入额外的标题。

因此, requests 也不会检测到数据是gzip编码的。数据都在那里,你只需要解码它。或者你可以,如果它不是不完整的。



解决方法是告诉服务器不要打扰压缩:

  headers = {'Accept-Encoding':'identity'} 
r = requests.get(url,headers = headers)

并返回一个未压缩的响应。

顺便说一下,在Python 2中, HTTP头解析器没有那么严格,并且设法将doctype头部声明为:

 >>> pprint(dict(r.headers))
{'<!doctype html public - // w3c // dtd xhtml 1.0 transitional // endtd / xhtml1-transitional.dtd>< html xmlns =http:'//www.w3.org/1999/xhtmllang =en-US>',
'连接':'Keep-Alive',
'content -encoding':'gzip',
'content-length':'3659',
'content-type':'text / html',
'date':'Tue,06 Jan 2015 17:42:06 GMT',
'keep-alive':'timeout = 5,max = 100',
'server':'Apache',
'vary': 'Accept-Encoding'}

和 content-encoding 信息仍然存在,因此请求按照预期为您解码内容。


I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this:

>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')

Instead of the basic html that is the source for this page, I get:

>>> r.text
'\x1f\ufffd\x08\x00\x00\x00\x00\x00\x00\x03\ufffd]o\u06f8\x12\ufffd\ufffd\ufffd+\ufffd]...

>>> r.content
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d]o\xdb\xb8\x12\x86\xef\xfb+\x88]\x14h...

I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don't understand what I am seeing above, haven't been able to turn it into anything I can read, and can't figure out how to get what I actually want. My question is, how do I get the html for the above page?

解决方案

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

这篇关于使用Python请求获取HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 15:08