问题描述
我正在使用 Scrapy 框架让蜘蛛爬行一些网页.基本上,我想要的是抓取网页并将它们保存到数据库中.我每个网页有一只蜘蛛.但是我无法立即运行这些蜘蛛,以至于在另一只蜘蛛完成爬行之后,蜘蛛开始爬行.怎样才能做到这一点?scrapyd 是解决方案吗?
I am using Scrapy framework to make spiders crawl through some webpages. Basically, what I want is to scrape web pages and save them to database. I have one spider per webpage. But I am having trouble to run those spiders at once such that a spider starts to crawl exactly after another spiders finishes crawling. How can that be achieved? Is scrapyd the solution?
推荐答案
scrapyd 确实是个好方法,max_proc 或 max_proc_per_cpu 配置可以用来限制并行spdiers的数量,然后你会schedule 使用scrapyd rest api 的蜘蛛,例如:
scrapyd is indeed a good way to go, max_proc or max_proc_per_cpu configuration can be used to restrict the number of parallel spdiers, you will then schedule spiders using scrapyd rest api like:
$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider
这篇关于一只一只地跑不止一只蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!