我正在跟踪一个不完整的教程here。我相信,我得到了与教程相同的代码,但是我的抓取工具仅抓取第一页,然后将有关我的第一个Request的以下消息提供给另一页,然后完成。我第二个yield语句是否放在错误的位置?


  调试:过滤到“ newyork.craigslist.org”的异地请求:https://newyork.craigslist.org/search/egr?s=120>
  
  2017-05-20 18:21:31 [scrapy.core.engine]信息:关闭蜘蛛(已完成)


这是我的代码:

import scrapy
from scrapy import Request


class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["https://newyork.craigslist.org/search/egr"]
    start_urls = ['https://newyork.craigslist.org/search/egr/']

    def parse(self, response):
        jobs = response.xpath('//p[@class="result-info"]')

        for job in jobs:
            title = job.xpath('a/text()').extract_first()
            address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
            relative_url = job.xpath('a/@href').extract_first("")
            absolute_url = response.urljoin(relative_url)

            yield {'URL': absolute_url, 'Title': title, 'Address': address}

        # scrape all pages
        next_page_relative_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        next_page_absolute_url = response.urljoin(next_page_relative_url)

        yield Request(next_page_absolute_url, callback=self.parse)

最佳答案

好的,所以我知道了。我必须更改此行:

allowed_domains = ["https://newyork.craigslist.org/search/egr"]


对此:

allowed_domains = ["newyork.craigslist.org"]


现在可以了。

关于python - 爬虫爬虫未刮过第一页,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/44088922/

10-10 19:27