我有一个蜘蛛从redis列表中获取url。
我想很好地关闭蜘蛛时,没有找到网址。我试图实现CloseSpider
异常,但似乎没有达到这一点
def start_requests(self):
while True:
item = json.loads(self.__pop_queue())
if not item:
raise CloseSpider("Closing spider because no more urls to crawl")
try:
yield scrapy.http.Request(item['product_url'], meta={'item': item})
except ValueError:
continue
即使我正在引发CloseSpider异常,但仍会出现以下错误:
root@355e42916706:/scrapper# scrapy crawl general -a country=my -a log=file
2017-07-17 12:05:13 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/scrapper/scrapper/spiders/GeneralSpider.py", line 20, in start_requests
item = json.loads(self.__pop_queue())
File "/usr/local/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer
此外,我还尝试在同一个函数中捕获TypeError,但它也不起作用。
有什么推荐的方法来处理这个问题吗
谢谢
最佳答案
您需要先检查self.__pop_queue()
是否返回了某些内容,然后再将其交给json.loads()
(或在调用时捕获TypeError
),例如:
def start_requests(self):
while True:
item = self.__pop_queue()
if not item:
raise CloseSpider("Closing spider because no more urls to crawl")
try:
item = json.loads(item)
yield scrapy.http.Request(item['product_url'], meta={'item': item})
except (ValueError, TypeError): # just in case the 'item' is not a string or buffer
continue
关于python - 如果没有网址爬网,请抓紧蜘蛛,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45143947/