本文介绍了一只一只地跑不止一只蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy 框架让蜘蛛爬行一些网页.基本上,我想要的是抓取网页并将它们保存到数据库中.我每个网页有一只蜘蛛.但是我无法立即运行这些蜘蛛,以至于在另一只蜘蛛完成爬行之后,蜘蛛开始爬行.怎样才能做到这一点?scrapyd 是解决方案吗?

I am using Scrapy framework to make spiders crawl through some webpages. Basically, what I want is to scrape web pages and save them to database. I have one spider per webpage. But I am having trouble to run those spiders at once such that a spider starts to crawl exactly after another spiders finishes crawling. How can that be achieved? Is scrapyd the solution?

推荐答案

scrapyd 确实是个好方法,max_procmax_proc_per_cpu 配置可以用来限制并行spdiers的数量,然后你会schedule 使用scrapyd rest api 的蜘蛛,例如:

scrapyd is indeed a good way to go, max_proc or max_proc_per_cpu configuration can be used to restrict the number of parallel spdiers, you will then schedule spiders using scrapyd rest api like:

$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider

这篇关于一只一只地跑不止一只蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 11:54