我试图从一个网站的几个页面上删除不同的信息。
直到第十六页,所有的工作:网页被爬网,刮和我的数据库中的信息存量然而,在第十六页之后,它停止报废,但继续爬行。
我查了一下网站,有470多页的信息。HTML标记是相同的所以我不明白为什么它不再报废了。
我的代码

def url_lister():
    url_list = []
    page_count = 1
    while page_count < 480:
        url = 'https://www.active.com/running?page=%s' %page_count
        url_list.append(url)
        page_count += 1
    return url_list

class ListeCourse_level1(scrapy.Spider):
    name = 'ListeCAP_ACTIVE'
    allowed_domains = ['www.active.com']
    start_urls = url_lister()

    def parse(self, response):
        selector = Selector(response)
        for uneCourse in response.xpath('//*[@id="lpf-tabs2-a"]/article/div/div/div/a[@itemprop="url"]'):
            loader = ItemLoader(ActiveItem(), selector=uneCourse)
            loader.add_xpath('nom_evenement', './/div[2]/div/h5[@itemprop="name"]/text()')
        loader.default_input_processor = MapCompose(string)
        loader.default_output_processor = Join()
        yield loader.load_item()
    pass

贝壳
>     2018-01-23 17:22:29 [scrapy.core.scraper] DEBUG: Scraped from <200
>     https://www.active.com/running?page=15>
>     {
>      'nom_evenement': 'Enniscrone 10k run & 5k run/walk',
>      }
>     2018-01-23 17:22:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=16> (referer: None)
>     --------------------------------------------------
>                     SCRAPING DES ELEMENTS EVENTS
>     --------------------------------------------------
>     2018-01-23 17:22:34 [scrapy.extensions.logstats] INFO: Crawled 17 pages (at 17 pages/min), scraped 155 items (at 155 items/min)
>     2018-01-23 17:22:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=17> (referer: None)
>
> --------------------------------------------------
>                 SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=18> (referer: None)
> --------------------------------------------------
>                 SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=19> (referer: None)

最佳答案

这可能是由于只有17页的内容是您要查找的,而您指示scrapy访问所有480页的表单https://www.active.com/running?page=NNN。一个更好的方法是检查您访问的每一页是否有下一页,只有在这种情况下,才能使Request转到下一页。
因此,我将重构您的代码,使其类似于(未测试的):

class ListeCourse_level1(scrapy.Spider):
    name = 'ListeCAP_ACTIVE'
    allowed_domains = ['www.active.com']
    base_url = 'https://www.active.com/running'
    start_urls = [base_url]

    def parse(self, response):
        selector = Selector(response)
        for uneCourse in response.xpath('//*[@id="lpf-tabs2-a"]/article/div/div/div/a[@itemprop="url"]'):
            loader = ItemLoader(ActiveItem(), selector=uneCourse)
            loader.add_xpath('nom_evenement', './/div[2]/div/h5[@itemprop="name"]/text()')
        loader.default_input_processor = MapCompose(string)
        loader.default_output_processor = Join()
        yield loader.load_item()
        # check for next page link
        if response.xpath('//a[contains(@class, "next-page")]'):
            next_page = response.meta.get('page_number', 1) + 1
            next_page_url = '{}?page={}'.format(base_url, next_page)
            yield scrapy.Request(next_page_url, callback=self.parse, meta={'page_number': next_page})

关于python - Scrapy停止抓取,但继续爬取,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/48406852/

10-10 17:05