本文介绍了捕获发生器内的错误,然后继续的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个迭代器,应该运行几天。我想要错误被捕获和报告,然后我希望迭代器继续。或者整个过程可以重新开始。



这是功能:

 单位= scraper.get_units()
i = 0
while True:
try:
unit = units.next()
除了StopIteration:
if i == 0:
log.error(Scraper returned 0 units,{'scraper':scraper})
break
except :
traceback.print_exc()
log.warning(get_units中发生异常,extra = {'scraper':scraper,'iteration':i})
else:
单位
i + = 1

因为 scraper 可能是许多变体的代码之一,它不能被信任,我不想处理那里的错误。



但是当一个错误发生在 units.next()中,整个事情停止。我怀疑因为迭代器在其中一个迭代失败时抛出一个 StopIteration



以下是输出(仅最后一行)

  [2012 -11-29 14:11:12 /home/amcat/amcat/scraping/scraper.py:135 DEBUG]抓取单位<元素div在0x4258c710> 
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG ]收到文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..收录文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 / home / amcat / amcat / scring / scraper .py:138 DEBUG] ..生成文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..产生文章
[ 2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 / home / amcat / amcat / scring.py / scraper.py:138 DEBUG] ..发表文章
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG]
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DE BUG] ..产生文章反恐精英:全球进攻更新发布
追溯(最近的最后一次呼叫):
文件/home/amcat/amcat/scraping/controller.py,第101行, get_units
unit = units.next()
文件/home/amcat/amcat/scraping/scraper.py,第114行,在get_units
中为self._get_units()中的单位:
文件/home/amcat/scraping/games/steamcommunity.py,第90行,_get_units
app_doc = self.getdoc(url,urlencode(form))
文件/ home /amcat/amcat/scraping/scraper.py,第231行,在getdoc
中返回self.opener.getdoc(url,encoding)
文件/home/amcat/amcat/scraping/htmltools.py ,第54行,getdoc
response = self.opener.open(url,encoding)
文件/usr/lib/python2.7/urllib2.py,第406行,打开
response = meth(req,response)
文件/usr/lib/python2.7/urllib2.py,第519行,http_response
'http',请求,响应,代码,msg ,hdrs)
文件/usr/lib/python2.7/urllib2.py,lin e 444,错误
return self._call_chain(* args)
文件/usr/lib/python2.7/urllib2.py,第378行,在_call_chain
result = func * args)
文件/usr/lib/python2.7/urllib2.py,第527行,http_error_default
raise HTTPError(req.get_full_url(),code,msg,hdrs,fp)
HTTPError:HTTP错误500:内部服务器错误
[2012-11-29 14:11:14 /home/amcat/amcat/scraping/controller.py:110警告] get_units中发生异常

...代码结束...

那么如何防止迭代发生错误时停止



编辑:这里是get_units()中的代码

  def get_units(self):

将刮削作业拆分成多个单位可以独立处理


@return:要传递给scrape_unit的任意对象序列

self._initialize()
for self._get_units()中的单位:
生产单位

这里是一个简化的_get_units():

  INDEX_URL =http://www.steamcommunity.com

def _get_units(self):
doc = self.getdoc(INDEX_URL)#在doc.cssselect(div.discussion a)中返回一个lxml.etree文档


link = a.get('href')
yield link

编辑:问题跟进:

解决方案

StopIteration next()发生器的方法,当没有下一个项目了。它与生成器/迭代器中的错误无关。



另外需要注意的是,根据迭代器的类型,它可能无法恢复一个例外。如果迭代器是具有 next 方法的对象,它将会工作。但是,如果它实际上是一个生成器,它不会。



据我所知,这是唯一的原因,为什么你的迭代不会在错误后继续来自 units.next()。即 units.next()失败,下次调用它时,它无法恢复,它表示通过抛出 StopIteration exception。



基本上你必须向我们显示 scraper.get_units() get_units()被实现为生成器函数,这很清楚。如果没有,这可能是阻止它恢复的东西。



更新:说明生成器函数是什么:

  class Scraper(object):
def get_units(self):
for some in some_stuff:
bla = do_some_processing()
bla * = 2#随机东西
产量bla

现在当您调用 Scraper()。get_units()时,不会运行整个函数,它返回一个生成器对象。调用 next(),将执行到第一个 yield 。等等现在如果在 get_units 内发生任何错误,就会被污染,所以说下次你调用 next(),它将提高 StopIteration ,就好像已经用完了给您的项目。



阅读(和)强烈推荐。



更新2:可能的解决方案


I have an iterator which is supposed to run for several days. I want errors to be caught and reported, and then I want the iterator to continue. Or the whole process can start over.

Here's the function:

def get_units(self, scraper):
    units = scraper.get_units()
    i = 0
    while True:
        try:
            unit = units.next()
        except StopIteration:
            if i == 0:
                log.error("Scraper returned 0 units", {'scraper': scraper})
            break
        except:
            traceback.print_exc()
            log.warning("Exception occurred in get_units", extra={'scraper': scraper, 'iteration': i})
        else:
            yield unit
        i += 1

Because scraper could be one of many variants of code, it can't be trusted and I don't want to handle the errors there.

But when an error occurs in units.next(), the whole thing stops. I suspect because an iterator throws a StopIteration when one of it's iterations fails.

Here's the output (only the last lines)

[2012-11-29 14:11:12 /home/amcat/amcat/scraping/scraper.py:135 DEBUG] Scraping unit <Element div at 0x4258c710>
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article Counter-Strike: Global Offensive Update Released
Traceback (most recent call last):
  File "/home/amcat/amcat/scraping/controller.py", line 101, in get_units
    unit = units.next()
  File "/home/amcat/amcat/scraping/scraper.py", line 114, in get_units
    for unit in self._get_units():
  File "/home/amcat/scraping/games/steamcommunity.py", line 90, in _get_units
    app_doc = self.getdoc(url,urlencode(form))
  File "/home/amcat/amcat/scraping/scraper.py", line 231, in getdoc
    return self.opener.getdoc(url, encoding)
  File "/home/amcat/amcat/scraping/htmltools.py", line 54, in getdoc
    response = self.opener.open(url, encoding)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
[2012-11-29 14:11:14 /home/amcat/amcat/scraping/controller.py:110 WARNING] Exception occurred in get_units

...code ends...

So how can I prevent the iterating to stop when an error occurs?

EDIT: here's the code within get_units()

def get_units(self):
    """
    Split the scraping job into a number of 'units' that can be processed independently
    of each other.

    @return: a sequence of arbitrary objects to be passed to scrape_unit
    """
    self._initialize()
    for unit in self._get_units():
        yield unit

And here's a simplified _get_units():

INDEX_URL = "http://www.steamcommunity.com"

def _get_units(self):
  doc = self.getdoc(INDEX_URL)  #returns a lxml.etree document

  for a in doc.cssselect("div.discussion a"):
    link = a.get('href')
    yield link

EDIT: question followup: Alter each for-loop in a function to have error handling executed automatically after each failed iteration

解决方案

StopIteration is raised by the next() method of a generator when there is no next item anymore. It has nothing to do with errors inside the generator/iterator.

Another thing to note is that, depending on the type of your iterator, it might not be able to resume after an exception. If the iterator is an object with a next method, it will work. However, if it's actually a generator, it won't.

As far as I can tell, this is the only reason why your iteration doesn't continue after an error from units.next(). I.e. units.next() fails, and the next time you call it, it's not able to resume and it says it's done by throwing a StopIteration exception.

Basically you'd have to show us the code inside scraper.get_units() for us to understand why the loop is not able to continue after an error inside a single iteration. If get_units() is implemented as a generator function, it's clear. If not, it might be something else that's preventing it from resuming.

UPDATE: explaining what a generator function is:

class Scraper(object):
    def get_units(self):
        for i in some_stuff:
            bla = do_some_processing()
            bla *= 2  # random stuff
            yield bla

Now, when you call Scraper().get_units(), instead of running the entire function, it returns a generator object. Calling next() on it, will take the execution to the first yield. Etc. Now if an error occurs ANYWHERE inside get_units, it will be tainted, so to say, and the next time you call next(), it will raise StopIteration, just as if it had run out of items to give you.

Reading of http://www.dabeaz.com/generators/ (and http://www.dabeaz.com/coroutines/) strongly recommended.

UPDATE2: A possible solution https://gist.github.com/4175802

这篇关于捕获发生器内的错误,然后继续的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-26 19:03