美丽的汤，得到警告，然后通过code中途出错

本文介绍了美丽的汤，得到警告，然后通过code中途出错的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我通过与日期涉及每一个维基百科页面迭代（1月1日，1月2日，...，12月31日）。在每一页上，我拿出的谁拥有在这一天过生日的人的名字。但是，中途我的code（4月27日），我收到这样的警告：

I am iterating through every wikipedia page that deals with a date (january 1, january 2, ...., december 31). On each page, I am taking out the names of people who have a birthday on that day. However, halfway through my code (April 27), I receive this warning:

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

然后，我得到一个错误的时候了：

Then, I get an error right away:

Traceback (most recent call last):
    File "wikipedia.py", line 29, in <module>
        section = soup.find('span', id='Births').parent
AttributeError: 'NoneType' object has no attribute 'parent'

基本上，我不能弄清楚为什么，在我一路到4月27日，它决定抛出这个警告和错误。这是4月27日页：

Basically, I cant figure out why, after I get all the way to April 27, that it decides to throw this warning and error. Here is the April 27 page:

据我所知，没有什么是有不同的，将做到这一点这种方式。还有一个跨度ID =诞生。

From what I can tell, nothing is different there that would make this happen this way. There is still a span with id="Births".

下面是我的code，其中我呼吁所有的东西：

Here's my code where I call all that stuff:

    site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = urllib2.Request(site,headers=hdr)    
    page = urllib2.urlopen(req)
    soup = BeautifulSoup(page)

    section = soup.find('span', id='Births').parent
    births = section.find_next('ul').find_all('li')

    for x in births:
        #All the regex and parsing, don't think it's necessary to show

该错误会被抛出该行：

The error is thrown on the line that reads:

section = soup.find('span', id='Births').parent

我做的时候我到4月27日（每个〜35000元的8名单）有大量的信息，但我不认为这是问题。如果任何人有任何想法，我AP preciate它。谢谢

I do have a lot of information by the time I get to April 27 (8 lists of ~35,000 elements each), but I don't think that would be the issue. If anyone has any ideas, I'd appreciate it. Thanks

推荐答案

它看起来像维基百科服务器提供该网页gzip压缩的：

It looks like the Wikipedia server is providing that page gzipped:

>>> page.info().get('Content-Encoding')
'gzip'

这不是没有在你的请求的编码接受头应该，但是，嗯，这与其他人的服务器工作时就是生活。

It's not supposed to without an accept-encoding header in your request, but, well, that's life when working with other people's servers.

有大量的来源在那里展示如何使用gzip压缩数据的工作 - 这里有一个：

There are a lot of sources out there showing how to work with gzipped data - here's one:http://www.diveintopython.net/http_web_services/gzip_compression.html

和这里的另一个：
<一href=\"http://stackoverflow.com/questions/3947120/does-python-urllib2-will-automaticly-uncom$p$pss-gzip-data-from-fetch-webpage\">Does蟒蛇的urllib2将从获取网页全自动uncom preSS gzip的数据

And here's another:Does python urllib2 will automaticly uncompress gzip data from fetch webpage

这篇关于美丽的汤，得到警告，然后通过code中途出错的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！