如何获取scrapy失败的URL?

本文介绍了如何获取scrapy失败的URL?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是scrapy的新手，这是我所知道的令人惊叹的爬虫框架！

I'm a newbie of scrapy and it's amazing crawler framework i have known!

在我的项目中，我发送了 90,000 多个请求，但其中一些失败了.我将日志级别设置为 INFO，我只能看到一些统计信息但没有详细信息.

In my project, I sent more than 90, 000 requests, but there are some of them failed.I set the log level to be INFO, and i just can see some statistics but no details.

2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1,
 'downloader/request_bytes': 46282582,
 'downloader/request_count': 92383,
 'downloader/request_method_count/GET': 92383,
 'downloader/response_bytes': 123766459,
 'downloader/response_count': 92382,
 'downloader/response_status_count/200': 92382,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000),
 'item_scraped_count': 46191,
 'request_depth_max': 1,
 'scheduler/memory_enqueued': 92383,
 'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}

有没有办法获得更详细的报告?例如，显示那些失败的 URL.谢谢！

Is there any way to get more detail report? For example, show those failed URLs. Thanks!

推荐答案

是的，这是可能的.

下面的代码将一个 failed_urls 列表添加到一个基本的蜘蛛类中，如果 url 的响应状态是 404，则将 url 附加到它上面(这需要扩展以涵盖其他需要的错误状态).
接下来，我添加了一个句柄，将列表连接成一个字符串，并在蜘蛛关闭时将其添加到蜘蛛的统计信息中.
根据您的评论，可以跟踪 Twisted 错误，下面的一些答案提供了有关如何处理该特定用例的示例
代码已更新以适用于 Scrapy 1.8.所有的感谢都应该归功于 Juliano Mendieta，因为我所做的只是添加他建议的编辑并确认蜘蛛按预期工作.

The code below adds a failed_urls list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required).
Next I added a handle that joins the list into a single string and adds it to the spider's stats when the spider is closed.
Based on your comments, it's possible to track Twisted errors, and some of the answers below give examples on how to handle that particular use case
The code has been updated to work with Scrapy 1.8. All thanks to this should go to Juliano Mendieta, since all I did was simply to add his suggested edits and confirm that the spider worked as intended.

from scrapy import Spider, signals

class MySpider(Spider):
    handle_httpstatus_list = [404]
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = [
        'http://www.example.com/thisurlexists.html',
        'http://www.example.com/thisurldoesnotexist.html',
        'http://www.example.com/neitherdoesthisone.html'
    ]

    def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.failed_urls = []

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.handle_spider_closed, signals.spider_closed)
        return spider

    def parse(self, response):
        if response.status == 404:
            self.crawler.stats.inc_value('failed_url_count')
            self.failed_urls.append(response.url)

    def handle_spider_closed(self, reason):
        self.crawler.stats.set_value('failed_urls', ', '.join(self.failed_urls))

    def process_exception(self, response, exception, spider):
        ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)
        self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
        self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

示例输出(请注意，只有在实际抛出异常时才会出现下载器/异常计数* 统计信息 - 我通过在关闭无线适配器后尝试运行蜘蛛来模拟它们):

Example output (note that the downloader/exception_count* stats will only appear if exceptions are actually thrown - I simulated them by trying to run the spider after I'd turned off my wireless adapter):

2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 15,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15,
     'downloader/request_bytes': 717,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 15209,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 2,
     'failed_url_count': 2,
     'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html'
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000),
     'log_count/DEBUG': 9,
     'log_count/ERROR': 2,
     'log_count/INFO': 4,
     'response_received_count': 3,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'spider_exceptions/NameError': 2,
     'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)}

这篇关于如何获取scrapy失败的URL?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Mendieta