本文介绍了美丽的汤,得到警告,然后通过code中途出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过与日期涉及每一个维基百科页面迭代(1月1日,1月2日,...,12月31日)。在每一页上,我拿出的谁拥有在这一天过生日的人的名字。但是,中途我的code(4月27日),我收到这样的警告:

I am iterating through every wikipedia page that deals with a date (january 1, january 2, ...., december 31). On each page, I am taking out the names of people who have a birthday on that day. However, halfway through my code (April 27), I receive this warning:

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

然后,我得到一个错误的时候了:

Then, I get an error right away:

Traceback (most recent call last):
    File "wikipedia.py", line 29, in <module>
        section = soup.find('span', id='Births').parent
AttributeError: 'NoneType' object has no attribute 'parent'

基本上,我不能弄清楚为什么,在我一路到4月27日,它决定抛出这个警告和错误。这是4月27日页:

Basically, I cant figure out why, after I get all the way to April 27, that it decides to throw this warning and error. Here is the April 27 page:

据我所知,没有什么是有不同的,将做到这一点这种方式。还有一个跨度ID =诞生。

From what I can tell, nothing is different there that would make this happen this way. There is still a span with id="Births".

下面是我的code,其中我呼吁所有的东西:

Here's my code where I call all that stuff:

    site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = urllib2.Request(site,headers=hdr)    
    page = urllib2.urlopen(req)
    soup = BeautifulSoup(page)

    section = soup.find('span', id='Births').parent
    births = section.find_next('ul').find_all('li')

    for x in births:
        #All the regex and parsing, don't think it's necessary to show

该错误会被抛出该行:

The error is thrown on the line that reads:

section = soup.find('span', id='Births').parent

我做的时候我到4月27日(每个〜35000元的8名单)有大量的信息,但我不认为这是问题。如果任何人有任何想法,我AP preciate它。谢谢

I do have a lot of information by the time I get to April 27 (8 lists of ~35,000 elements each), but I don't think that would be the issue. If anyone has any ideas, I'd appreciate it. Thanks

推荐答案

它看起来像维基百科服务器提供该网页gzip压缩的:

It looks like the Wikipedia server is providing that page gzipped:

>>> page.info().get('Content-Encoding')
'gzip'

这不是没有在你的请求的编码接受头应该,但是,嗯,这与其他人的服务器工作时就是生活。

It's not supposed to without an accept-encoding header in your request, but, well, that's life when working with other people's servers.

有大量的来源在那里展示如何使用gzip压缩数据的工作 - 这里有一个:

There are a lot of sources out there showing how to work with gzipped data - here's one:http://www.diveintopython.net/http_web_services/gzip_compression.html

和这里的另一个:
<一href=\"http://stackoverflow.com/questions/3947120/does-python-urllib2-will-automaticly-uncom$p$pss-gzip-data-from-fetch-webpage\">Does蟒蛇的urllib2将从获取网页全自动uncom preSS gzip的数据

And here's another:Does python urllib2 will automaticly uncompress gzip data from fetch webpage

这篇关于美丽的汤,得到警告,然后通过code中途出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-15 15:23