python - 如何设置爬虫的深度限制

我正在使用此蜘蛛来爬行页面并下载其图像：

import scrapy

from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem
import re

from urlparse import urljoin

class ImgurSpider(CrawlSpider):
    name = 'imgur'
    allowed_domains = ['some.page']

    start_urls = [u'some.page']

    rules = [Rule(LinkExtractor(allow=['.*']), 'parse_imgur')]

    def parse_imgur(self, response):
        image = ImgurItem()
        image['title'] = 'a'

        relative_urls = re.findall('= "([^"]+.jpg)',response.body)
        image['image_urls'] = [urljoin(response.url, url) for url in relative_urls]


        return image

但是我这里有两个问题，第一个是我无法在运行Spider时使用“ -s DEPTH_LIMIT = 1”将深度限制设置为一个高度：

scrapy抓取imgur -s DEPTH_LIMIT = 1

第二个问题是，我获得了除主页以外的所有网页图像：

我没有从该页面获取任何图像。

编辑。

A @ Javitronxo

像这样：

def parse(self, response):
    image = ImgurItem()
    image['title'] = 'a'

    relative_urls = re.findall('= "([^"]+.jpg)',response.body)
    image['image_urls'] = [urljoin(response.url, url) for url in relative_urls]


    return image

我没有那样的图像。

最佳答案

由于您的代码中有以下规则：

rules = [Rule(LinkExtractor(allow=['.*']), 'parse_imgur')]

蜘蛛程序正在从网页中提取所有链接，因此最终将跟踪它们。

如果只想在主页上爬网图像，建议删除规则并更改方法标题以覆盖默认的parse：

def parse(self, response):

这样，蜘蛛程序将开始对start_url字段中的图像进行爬网，返回对象，并完成执行。

关于python - 如何设置爬虫的深度限制，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/35130610/