问题描述
我有一个网站,我的爬虫需要遵循一个序列.因此,例如,它需要先经过 a1、b1、c1,然后才能开始 a2 等.a、b 和 c 中的每一个都由不同的解析函数处理,并在 Request 对象中创建相应的 url 并生成.下面大致说明了我使用的代码:
类蜘蛛(BaseSpider):定义解析(自我,响应):产量请求(b,回调=self.parse_b,优先级=10)def parse_b(self,response):产量请求(c,回调=self.parse_c,优先级=20)def parse_c(self,response)final_function()
然而,我发现爬行的顺序似乎是 a1,a2,a3,b1,b2,b3,c1,c2,c3 这很奇怪,因为我认为 Scrapy 应该首先保证深度.
顺序不必很严格,但是我正在抓取的站点有一个限制,因此 Scrapy 需要在 5 级 bs 被抓取之前尽快开始抓取 c 级.如何实现?
深度优先搜索正是您所描述的:
在移动到 b 之前尽可能深入地搜索
要将 Scrapy 更改为进行广度优先搜索(a1、b1、c1、a2 等...),请更改以下设置:
DEPTH_PRIORITY = 1SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
I have a website for which my crawler needs to follow a sequence. So for example, it needs to go a1, b1, c1 before it starts going a2 etc. each of a, b and c are handled by different parse functions and the corresponding urls are created in a Request object and yielded. The following roughly illustrates the code I'm using:
class aspider(BaseSpider):
def parse(self,response):
yield Request(b, callback=self.parse_b, priority=10)
def parse_b(self,response):
yield Request(c, callback=self.parse_c, priority=20)
def parse_c(self,response)
final_function()
However, I find that the sequence of crawls seem to be a1,a2,a3,b1,b2,b3,c1,c2,c3 which is strange since I thought Scrapy is supposed to guarantee depth first.
The sequence doesn't have to be strict, but the site I'm scraping has a limit in place so Scrapy need to start scraping level c as soon as it can before 5 of level bs get crawled. How can this be achieved?
Depth first searching is exactly what you are describing:
To change Scrapy to do breadth-first searching (a1, b1, c1, a2, etc...), change these settings:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
*Found in the doc.scrapy.org FAQ
这篇关于Scrapy 似乎没有在做 DFO的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!