问题描述
我有两个想要并行执行的蜘蛛.我使用了 CrawlerProcess
实例及其 crawl
方法来实现这一点.但是,我想指定不同的输出文件,即FEED_URI
为同一进程中的每个蜘蛛.我试图循环蜘蛛并运行它们,如下所示.尽管生成了两个不同的输出文件,进程在第二个蜘蛛完成执行后立即终止.如果第一个蜘蛛在第二个之前完成爬行,我会得到所需的输出.但是,如果第二个蜘蛛先完成爬行,则它不会等待第一个蜘蛛完成.我该如何解决这个问题?
I have two spiders which I want to execute in parallel. I used the CrawlerProcess
instance and its crawl
method to acheieve this. However, I want to specify different output file, ie FEED_URI
for each spider in the same process. I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution. If the first spider completes crawling before the second one, I get the desired output. However, if the second spider finishes crawling first, then it doesn't wait for the first spider to complete. How could I actually fix this?
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spider_loader.list():
setting['FEED_FORMAT'] = 'json'
setting['LOG_LEVEL'] = 'INFO'
setting['FEED_URI'] = spider_name+'.json'
setting['LOG_FILE'] = spider_name+'.log'
process = CrawlerProcess(setting)
print("Running spider %s" % spider_name)
process.crawl(spider_name)
process.start()
print("Completed")
推荐答案
根据scrapy docs 对多个蜘蛛使用单个 CrawlerProcess
应该如下所示:
According to scrapy docs using single CrawlerProcess
for multiple spiders should look like this:
import scrapy
from scrapy.crawler import CrawlerProcess
class Spider1(scrapy.Spider):
...
class Spider2(scrapy.Spider):
...
process = CrawlerProcess()
process.crawl(Spider1)
process.crawl(Spider2)
process.start()
setting.. 可以使用 custom_settings
蜘蛛属性
setting.. settings on per spider basis can be done using custom_settings
spider attribute
Scrapy 有一组模块不能在每个蜘蛛的基础上设置(只能每个CrawlerProcess
).
使用 Logging、SpiderLoader 和 Twisted Reactor 相关设置的模块 - 在 Scrapy 读取蜘蛛 custom_settings
之前已经初始化.
Scrapy has a group of modules that can't be set on per spider basis (only perCrawlerProcess
).
modules that using Logging, SpiderLoader and twisted Reactor related settings - already initialized before Scrapy read spider custom_settings
.
当您从命令行工具调用 scrapy crawl ....
时 - 实际上您创建了 单个 CrawlerProcess
用于在命令参数上定义的单个蜘蛛.
When you call scrapy crawl ....
from command line tool - in fact you create single CrawlerProcess
for single spider defined on command args.
一旦第二个蜘蛛完成执行,进程就会终止.
如果您使用了以前由 scrapy crawl...
启动的相同蜘蛛版本,这不是预期的.
If you used the same spider versions previously launched by scrapy crawl...
this is not expected.
这篇关于如何在 CrawlerProcess Scrapy 中为两个不同的蜘蛛指定不同的进程设置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!