本文介绍了Scrapy 中的内存泄漏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了以下代码来抓取电子邮件地址(用于测试目的):

i wrote the following code to scrape for email addresses (for testing purposes):

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from crawler.items import EmailItem

class LinkExtractorSpider(CrawlSpider):
    name = 'emailextractor'
    start_urls = ['http://news.google.com']

    rules = ( Rule (LinkExtractor(), callback='process_item', follow=True),)

    def process_item(self, response):
        refer = response.url
        items = list()
        for email in Selector(response).re("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}"):

            emailitem = EmailItem()
            emailitem['email'] = email
            emailitem['refer'] = refer
            items.append(emailitem)
        return items

不幸的是,对 Requests 的引用似乎没有正确关闭,就像使用 scrapy telnet 控制台一样,Requests 的数量增加了 5k/s.在大约 3 分钟和 10k 抓取页面后,我的系统开始交换(8GB RAM).任何人都知道出了什么问题?我已经尝试使用

Unfortunately, it seems that references to the Requests are not closed properly, as with the scrapy telnet console, the number of Requests increases by 5k/s. After ~3min and 10k scraped pages, my system starts swapping (8GB RAM).Anyone got an idea what is wrong?I already tried to remove the refer and "copied" the string using

emailitem['email'] = ''.join(email)

没有成功.抓取后,项目将保存到 BerkeleyDB 中,计算它们的出现次数(使用管道),因此在此之后引用应该消失.

without success.After scraping, the items get saved into a BerkeleyDB counting their occurrences (using pipelines), so the references should be gone after that.

返回一组项目和分别生成每个项目有什么区别?

What would be the difference between returning a set of items and yielding each item separately?

经过一段时间的调试,我发现请求没有被释放,所以我最终得到:

After quite a while of debugging I found out, that the Requests are not freed, such that I end up with:

$> nc localhost 6023
>>> prefs()
Live References
Request 10344   oldest: 536s ago
>>> from scrapy.utils.trackref import get_oldest
>>> r = get_oldest('Request')
>>> r.url
<GET http://news.google.com>

这实际上是起始网址.有谁知道问题是什么?对 Request 对象的缺失引用在哪里?

which is in fact the start url.Anybody knows what the problem is? Where is the missing reference to the Request object?

编辑 2:

在服务器(具有 64GB RAM)上运行约 12 小时后,使用的 RAM 为约 16GB(使用 ps,即使 ps 不是正确的工具).问题是,抓取页面的数量显着下降,抓取项目的数量自几个小时以来一直为 0:

After running for ~12 hours on a server (having 64GB RAM), the RAM used is ~16GB (using ps, even if ps is not the right tool for it). The problem is, that the number of crawled pages is going significantly down and the number of scraped items remains 0 since hours:

INFO: Crawled 122902 pages (at 82 pages/min), scraped 3354 items (at 0 items/min)

我做了 objgraph 分析,结果如下图(感谢@Artur Gaspar):

I did the objgraph analysis which results in the following graph (thanks @Artur Gaspar):

我似乎无法影响它?

推荐答案

对我来说最终的答案是使用基于磁盘的队列以及作为运行时参数的工作目录.

The final answer for me was the use of a disk-based queue in conjunction with a working directory as runtime parameter.

这是在settings.py中添加以下代码:

This is adding the following code to the settings.py:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

之后,使用以下命令行启动爬虫使给定目录中的更改持久化:

afterwards, starting the crawler using the following commandline makes the changes persistent in the given directory:

scrapy crawl {spidername} -s JOBDIR=crawls/{spidername} 见scrapy有关详细信息的文档

scrapy crawl {spidername} -s JOBDIR=crawls/{spidername} see scrapy docs for details

这种方法的额外好处是,可以随时暂停和恢复抓取.我的蜘蛛现在运行了超过 11 天,阻塞了大约 15GB 的内存(磁盘 FIFO 队列的文件缓存)

The addidtional benefit of this approach is, that the crawl can be paused and resumed at any time.My spider now runs for more than 11 days blocking ~15GB memory (file cache memory for disk FIFO queues)

这篇关于Scrapy 中的内存泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 20:40