问题描述
我正在将 scrapy 用于一个我想抓取多个站点(可能是数百个)的项目,并且我必须为每个站点编写一个特定的蜘蛛.我可以在部署到scrapyd的项目中使用一个蜘蛛:
I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using:
curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2
但是我如何一次在一个项目中安排所有蜘蛛?
But how do I schedule all spiders in a project at once?
非常感谢所有帮助!
推荐答案
我一次运行 200 多个蜘蛛的解决方案是为项目创建自定义命令.请参阅 http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands 了解有关实施自定义命令的更多信息.
My solution for running 200+ spiders at once has been to create a custom command for the project. See http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands for more information about implementing custom commands.
您的项目名称/commands/allcrawl.py:
from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log
class AllCrawlCommand(ScrapyCommand):
requires_project = True
default_settings = {'LOG_ENABLED': False}
def short_desc(self):
return "Schedule a run for all available spiders"
def run(self, args, opts):
url = 'http://localhost:6800/schedule.json'
for s in self.crawler.spiders.list():
values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
log.msg(response)
确保在 settings.py 中包含以下内容
Make sure to include the following in your settings.py
COMMANDS_MODULE = 'YOURPROJECTNAME.commands'
然后从命令行(在您的项目目录中)您可以简单地输入
Then from the command line (in your project directory) you can simply type
scrapy allcrawl
这篇关于使用scrapyd一次运行多个scrapy蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!