本文介绍了如何在 CrawlerProcess Scrapy 中为两个不同的蜘蛛指定不同的进程设置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个想要并行执行的蜘蛛.我使用了 CrawlerProcess 实例及其 crawl 方法来实现这一点.但是,我想指定不同的输出文件,即FEED_URI 为同一进程中的每个蜘蛛.我试图循环蜘蛛并运行它们,如下所示.尽管生成了两个不同的输出文件,进程在第二个蜘蛛完成执行后立即终止.如果第一个蜘蛛在第二个之前完成爬行,我会得到所需的输出.但是,如果第二个蜘蛛先完成爬行,则它不会等待第一个蜘蛛完成.我该如何解决这个问题?

I have two spiders which I want to execute in parallel. I used the CrawlerProcess instance and its crawl method to acheieve this. However, I want to specify different output file, ie FEED_URI for each spider in the same process. I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution. If the first spider completes crawling before the second one, I get the desired output. However, if the second spider finishes crawling first, then it doesn't wait for the first spider to complete. How could I actually fix this?

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spider_loader.list():
    setting['FEED_FORMAT'] = 'json'
    setting['LOG_LEVEL'] = 'INFO'
    setting['FEED_URI'] = spider_name+'.json'
    setting['LOG_FILE'] = spider_name+'.log'
    process = CrawlerProcess(setting)
    print("Running spider %s" % spider_name)
    process.crawl(spider_name)

process.start()
print("Completed")

推荐答案

根据scrapy docs 对多个蜘蛛使用单个 CrawlerProcess 应该如下所示:

According to scrapy docs using single CrawlerProcess for multiple spiders should look like this:

import scrapy
from scrapy.crawler import CrawlerProcess

class Spider1(scrapy.Spider):
    ...

class Spider2(scrapy.Spider):
    ...

process = CrawlerProcess()
process.crawl(Spider1)
process.crawl(Spider2)
process.start()

setting.. 可以使用 custom_settings 蜘蛛属性

setting.. settings on per spider basis can be done using custom_settings spider attribute

Scrapy 有一组模块不能在每个蜘蛛的基础上设置(只能每个CrawlerProcess).

使用 Logging、SpiderLoader 和 Twisted Reactor 相关设置的模块 - 在 Scrapy 读取蜘蛛 custom_settings 之前已经初始化.

Scrapy has a group of modules that can't be set on per spider basis (only perCrawlerProcess).

modules that using Logging, SpiderLoader and twisted Reactor related settings - already initialized before Scrapy read spider custom_settings.

当您从命令行工具调用 scrapy crawl .... 时 - 实际上您创建了 单个 CrawlerProcess 用于在命令参数上定义的单个蜘蛛.

When you call scrapy crawl .... from command line tool - in fact you create single CrawlerProcess for single spider defined on command args.

一旦第二个蜘蛛完成执行,进程就会终止.

如果您使用了以前由 scrapy crawl... 启动的相同蜘蛛版本,这不是预期的.

If you used the same spider versions previously launched by scrapy crawl... this is not expected.

这篇关于如何在 CrawlerProcess Scrapy 中为两个不同的蜘蛛指定不同的进程设置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-11 21:56