不会抓取所有页面

不会抓取所有页面

本文介绍了Scrapy 不会抓取所有页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的工作代码:

from scrapy.item import Item, Field

class Test2Item(Item):
    title = Field()

from scrapy.http import Request
from scrapy.conf import settings
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class Khmer24Spider(CrawlSpider):
    name = 'khmer24'
    allowed_domains = ['www.khmer24.com']
    start_urls = ['http://www.khmer24.com/']
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1"
    DOWNLOAD_DELAY = 2

    rules = (
        Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = Test2Item()
        i['title'] = (hxs.select(('//div[@class="innerbox"]/h1/text()')).extract()[0]).strip(' \t\n\r')
        return i

它只能报废 10 或 15 条记录.总是随机数!我无法获得所有具有像 http://www.khmer24.com/ad/any-words/67-anynumber.html

It can scrap only 10 or 15 records. Always random numbers! I can't manage to get all pages that has the pattern like http://www.khmer24.com/ad/any-words/67-anynumber.html

我真的怀疑 Scrapy 因为重复请求而完成了抓取.他们建议使用 dont_filter = True 但是,我不知道将它放在我的代码中的什么位置.

我是 Scrapy 的新手,真的需要帮助.

I really suspect that Scrapy finished crawling because of duplicate the request. They have suggested to use dont_filter = True however, I have no idea of where to put it in my code.

I'm a newbie to Scrapy and really need help.

推荐答案

1."他们建议使用 dont_filter = True 但是,我不知道将它放在我的代码中的什么位置."

1."They have suggested to use dont_filter = True however, I have no idea of where to put it in my code."

这个参数在 BaseSpider 中,它是 CrawlSpider 继承的.(scrapy/spider.py) 它被设置为默认为真.

This argument is in BaseSpider, which CrawlSpider inherits from. (scrapy/spider.py) And it's set as True by default.

2.它只能报废 10 或 15 条记录."

2."It can scrap only 10 or 15 records."

原因:这是因为 start_urls 不是那么好.在这个问题中,蜘蛛开始在 http://www.khmer24.com/ 中爬行,我们假设它有 10 个要遵循的 url(满足模式).然后,蜘蛛继续抓取这 10 个网址.但由于这些页面包含的满意模式太少,蜘蛛会得到一些网址(甚至没有网址),从而导致停止爬行.

Reason:This is because the start_urls is not that good.In this problem, the spider starts crawling in http://www.khmer24.com/, and let's assume that it gets 10 urls to follow(which are satisfied the pattern).And then, the spider goes on crawling these 10 urls. But as these pages contain so little satisfied pattern, the spider gets a few urls to follow(even no urls), which results in stopping crawling.

可能的解决方案:我上面所说的原因只是重申了icecrime 的观点.解决方案也是如此.

Possible solution:The reason what I said above just restates icecrime's opinion. And so does the solution.

  • 建议将所有广告"页面用作start_urls.(您也可以将主页用作 start_urls 并使用新的规则.)

  • Suggest to use the 'All ads' page as start_urls. (You could also use the home page as start_urls and use the new rules.)

规则:

rules = (
    # Extract all links and follow links from them
    # (since no callback means follow=True by default)
    # (If "allow" is not given, it will match all links.)
    Rule(SgmlLinkExtractor()),

    # Extract links matching the "ad/any-words/67-anynumber.html" pattern
    # and parse them with the spider's method parse_item (NOT FOLLOW THEM)
    Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item'),
)

参考:SgmlLinkExtractor,CrawlSpider 示例

这篇关于Scrapy 不会抓取所有页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 21:11