本文介绍了信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始学习 Python 和 Scrapy.我的第一个项目是抓取包含网络安全信息的网站上的信息.但是当我使用 cmd 运行它时,它说抓取了 0 页(以 0 页/分钟),抓取了 0 个项目(以 0 个项目/分钟)"并且似乎没有任何结果.如果有人能解决我的问题,我将不胜感激.

I just began to learn Python and Scrapy.My first project is to crawl information on a website containing web security information. But when I run that using cmd, it says that "Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)" and nothing seems to come out. I'd be grateful if someone kind could solve my problem.

我的代码:

import scrapy

class SapoSpider(scrapy.Spider):
name = "imo"
allowed_domains = ["imovirtual.com"]
start_urls = ["https://www.imovirtual.com/arrendar/apartamento/lisboa/"]

def parse(self,response):
    subpage_links = []
    for i in response.css('div.offer-item-details'):
        youritem = {
        'preco':i.css('span.offer-item title::text').extract_first(),
        'autor':i.css('li.offer-item-price::text').extract(),
        'data':i.css('li.offer-item-area::text').extract(),
        'data_2':i.css('li.offer-item-price-perm::text').extract()
        }
        subpage_link = i.css('header[class=offer-item-header] a::attr(href)').extract()
        subpage_links.extend(subpage_link)

        for subpage_link in subpage_links:
            yield scrapy.Request(subpage_link, callback=self.parse_subpage, meta={'item':youritem})

def parse_subpage(self,response):
    for j in response.css('header[class=offer-item-header] a::attr(href)'):
        youritem = response.meta.get('item')
        youritem['info'] = j.css(' ul.dotted-list, li.h4::text').extract()
        yield youritem

推荐答案

有两件事需要纠正以使其正常工作:

There are two things to correct to make it work:

  • 您需要使用要存储结果的路径定义 FEED_URI 设置

  • You need to define FEED_URI setting with the path you want to store the result

你需要在parse_subpage中使用response,因为逻辑如下:scrapy downloads "https://www.imovirtual.com/arrendar/apartamento/lisboa/" 并给出对parse的响应,您提取广告网址并要求scrapy下载每个页面并将下载的页面提供给parse_subpage.所以responseinparse_subpage`对应这个https://www.imovirtual.com/anuncio/t0-totalmente-remodelado-localizacao-excelente-IDGBAY.html#913474cdaa 例如

You need to use response in parse_subpage because the logic is the following: scrapy downloads "https://www.imovirtual.com/arrendar/apartamento/lisboa/" and gives the response toparse, you extract ads url and you ask scrapy to download each pages and give the downloaded pages toparse_subpage. Soresponseinparse_subpage` corresponds to this https://www.imovirtual.com/anuncio/t0-totalmente-remodelado-localizacao-excelente-IDGBAY.html#913474cdaa for example

这应该有效:

import scrapy


class SapoSpider(scrapy.Spider):
    name = "imo"
    allowed_domains = ["imovirtual.com"]
    start_urls = ["https://www.imovirtual.com/arrendar/apartamento/lisboa/"]
    custom_settings = {
        'FEED_URI': './output.json'
    }
    def parse(self,response):
        subpage_links = []
        for i in response.css('div.offer-item-details'):
            youritem = {
            'preco':i.css('span.offer-item title::text').extract_first(),
            'autor':i.css('li.offer-item-price::text').extract(),
            'data':i.css('li.offer-item-area::text').extract(),
            'data_2':i.css('li.offer-item-price-perm::text').extract()
            }
            subpage_link = i.css('header[class=offer-item-header] a::attr(href)').extract()
            subpage_links.extend(subpage_link)

            for subpage_link in subpage_links:
                yield scrapy.Request(subpage_link, callback=self.parse_subpage, meta={'item':youritem})

    def parse_subpage(self,response):
        youritem = response.meta.get('item')
        youritem['info'] = response.css(' ul.dotted-list, li.h4::text').extract()
        yield youritem

这篇关于信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 12:32