问题描述
我有一个迭代器,应该运行几天。我想要错误被捕获和报告,然后我希望迭代器继续。或者整个过程可以重新开始。这是功能:
单位= scraper.get_units()
i = 0
while True:
try:
unit = units.next()
除了StopIteration:
if i == 0:
log.error(Scraper returned 0 units,{'scraper':scraper})
break
except :
traceback.print_exc()
log.warning(get_units中发生异常,extra = {'scraper':scraper,'iteration':i})
else:
单位
i + = 1
因为 scraper
可能是许多变体的代码之一,它不能被信任,我不想处理那里的错误。
但是当一个错误发生在 units.next()
中,整个事情停止。我怀疑因为迭代器在其中一个迭代失败时抛出一个 StopIteration
。
以下是输出(仅最后一行)
[2012 -11-29 14:11:12 /home/amcat/amcat/scraping/scraper.py:135 DEBUG]抓取单位<元素div在0x4258c710>
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG ]收到文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..收录文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 / home / amcat / amcat / scring / scraper .py:138 DEBUG] ..生成文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..产生文章
[ 2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 / home / amcat / amcat / scring.py / scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG]
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DE BUG] ..产生文章反恐精英:全球进攻更新发布
追溯(最近的最后一次呼叫):
文件/home/amcat/amcat/scraping/controller.py,第101行, get_units
unit = units.next()
文件/home/amcat/amcat/scraping/scraper.py,第114行,在get_units
中为self._get_units()中的单位:
文件/home/amcat/scraping/games/steamcommunity.py,第90行,_get_units
app_doc = self.getdoc(url,urlencode(form))
文件/ home /amcat/amcat/scraping/scraper.py,第231行,在getdoc
中返回self.opener.getdoc(url,encoding)
文件/home/amcat/amcat/scraping/htmltools.py ,第54行,getdoc
response = self.opener.open(url,encoding)
文件/usr/lib/python2.7/urllib2.py,第406行,打开
response = meth(req,response)
文件/usr/lib/python2.7/urllib2.py,第519行,http_response
'http',请求,响应,代码,msg ,hdrs)
文件/usr/lib/python2.7/urllib2.py,lin e 444,错误
return self._call_chain(* args)
文件/usr/lib/python2.7/urllib2.py,第378行,在_call_chain
result = func * args)
文件/usr/lib/python2.7/urllib2.py,第527行,http_error_default
raise HTTPError(req.get_full_url(),code,msg,hdrs,fp)
HTTPError:HTTP错误500:内部服务器错误
[2012-11-29 14:11:14 /home/amcat/amcat/scraping/controller.py:110警告] get_units中发生异常
...代码结束...
那么如何防止迭代发生错误时停止
编辑:这里是get_units()中的代码
def get_units(self):
将刮削作业拆分成多个单位可以独立处理
。
@return:要传递给scrape_unit的任意对象序列
self._initialize()
for self._get_units()中的单位:
生产单位
这里是一个简化的_get_units():
INDEX_URL =http://www.steamcommunity.com
def _get_units(self):
doc = self.getdoc(INDEX_URL)#在doc.cssselect(div.discussion a)中返回一个lxml.etree文档
:
link = a.get('href')
yield link
编辑:问题跟进:
StopIteration
由 next()
发生器的方法,当没有下一个项目了。它与生成器/迭代器中的错误无关。
另外需要注意的是,根据迭代器的类型,它可能无法恢复一个例外。如果迭代器是具有 next
方法的对象,它将会工作。但是,如果它实际上是一个生成器,它不会。
据我所知,这是唯一的原因,为什么你的迭代不会在错误后继续来自 units.next()
。即 units.next()
失败,下次调用它时,它无法恢复,它表示通过抛出 StopIteration
exception。
基本上你必须向我们显示 scraper.get_units()
get_units()被实现为生成器函数,这很清楚。如果没有,这可能是阻止它恢复的东西。
更新:说明生成器函数是什么:
class Scraper(object):
def get_units(self):
for some in some_stuff:
bla = do_some_processing()
bla * = 2#随机东西
产量bla
现在当您调用 Scraper()。get_units()
时,不会运行整个函数,它返回一个生成器对象。调用 next()
,将执行到第一个 yield
。等等现在如果在 get_units
内发生任何错误,就会被污染,所以说下次你调用 next()
,它将提高 StopIteration
,就好像已经用完了给您的项目。
阅读(和)强烈推荐。
更新2:可能的解决方案
I have an iterator which is supposed to run for several days. I want errors to be caught and reported, and then I want the iterator to continue. Or the whole process can start over.
Here's the function:
def get_units(self, scraper):
units = scraper.get_units()
i = 0
while True:
try:
unit = units.next()
except StopIteration:
if i == 0:
log.error("Scraper returned 0 units", {'scraper': scraper})
break
except:
traceback.print_exc()
log.warning("Exception occurred in get_units", extra={'scraper': scraper, 'iteration': i})
else:
yield unit
i += 1
Because scraper
could be one of many variants of code, it can't be trusted and I don't want to handle the errors there.
But when an error occurs in units.next()
, the whole thing stops. I suspect because an iterator throws a StopIteration
when one of it's iterations fails.
Here's the output (only the last lines)
[2012-11-29 14:11:12 /home/amcat/amcat/scraping/scraper.py:135 DEBUG] Scraping unit <Element div at 0x4258c710>
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article Counter-Strike: Global Offensive Update Released
Traceback (most recent call last):
File "/home/amcat/amcat/scraping/controller.py", line 101, in get_units
unit = units.next()
File "/home/amcat/amcat/scraping/scraper.py", line 114, in get_units
for unit in self._get_units():
File "/home/amcat/scraping/games/steamcommunity.py", line 90, in _get_units
app_doc = self.getdoc(url,urlencode(form))
File "/home/amcat/amcat/scraping/scraper.py", line 231, in getdoc
return self.opener.getdoc(url, encoding)
File "/home/amcat/amcat/scraping/htmltools.py", line 54, in getdoc
response = self.opener.open(url, encoding)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
[2012-11-29 14:11:14 /home/amcat/amcat/scraping/controller.py:110 WARNING] Exception occurred in get_units
...code ends...
So how can I prevent the iterating to stop when an error occurs?
EDIT: here's the code within get_units()
def get_units(self):
"""
Split the scraping job into a number of 'units' that can be processed independently
of each other.
@return: a sequence of arbitrary objects to be passed to scrape_unit
"""
self._initialize()
for unit in self._get_units():
yield unit
And here's a simplified _get_units():
INDEX_URL = "http://www.steamcommunity.com"
def _get_units(self):
doc = self.getdoc(INDEX_URL) #returns a lxml.etree document
for a in doc.cssselect("div.discussion a"):
link = a.get('href')
yield link
EDIT: question followup: Alter each for-loop in a function to have error handling executed automatically after each failed iteration
StopIteration
is raised by the next()
method of a generator when there is no next item anymore. It has nothing to do with errors inside the generator/iterator.
Another thing to note is that, depending on the type of your iterator, it might not be able to resume after an exception. If the iterator is an object with a next
method, it will work. However, if it's actually a generator, it won't.
As far as I can tell, this is the only reason why your iteration doesn't continue after an error from units.next()
. I.e. units.next()
fails, and the next time you call it, it's not able to resume and it says it's done by throwing a StopIteration
exception.
Basically you'd have to show us the code inside scraper.get_units()
for us to understand why the loop is not able to continue after an error inside a single iteration. If get_units()
is implemented as a generator function, it's clear. If not, it might be something else that's preventing it from resuming.
UPDATE: explaining what a generator function is:
class Scraper(object):
def get_units(self):
for i in some_stuff:
bla = do_some_processing()
bla *= 2 # random stuff
yield bla
Now, when you call Scraper().get_units()
, instead of running the entire function, it returns a generator object. Calling next()
on it, will take the execution to the first yield
. Etc. Now if an error occurs ANYWHERE inside get_units
, it will be tainted, so to say, and the next time you call next()
, it will raise StopIteration
, just as if it had run out of items to give you.
Reading of http://www.dabeaz.com/generators/ (and http://www.dabeaz.com/coroutines/) strongly recommended.
UPDATE2: A possible solution https://gist.github.com/4175802
这篇关于捕获发生器内的错误,然后继续的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!