本文介绍了为什么我在 Scrapy 中的输入/输出处理器不工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习本教程.

我希望我的 desc 字段是标准化为单个空格和大写的单个字符串.

I want my desc field to be a single string normalized to single spaces, and in uppercase.

dmoz_spider.py

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

我尝试根据 http://doc.scrapy.org/en/latest/topics/loaders.html#declaring-input-and-output-processors

items.py

import scrapy
from scrapy.loader.processors import MapCompose, Join

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field(
        input_processor=MapCompose(
            lambda x: ' '.join(x.split()),
            lambda x: x.upper()
        ),
        output_processor=Join()
    )

但是,我的输出结果仍然是这样.

However, my output still turns out like this.

{'desc': ['\r\n\t\r\n                                ',
          ' \r\n'
          '\t\t\t\r\n'
          '                                - By David Mertz; Addison Wesley. '
          'Book in progress, full text, ASCII format. Asks for feedback. '
          '[author website, Gnosis Software, Inc.]\r\n'
          '                                \r\n'
          '                                ',
          '\r\n                                '],
 'link': ['http://gnosis.cx/TPiP/'],
 'title': ['Text Processing in Python']}

我做错了什么?

我使用的是 Python 3.5.1 和 Scrapy 1.1.0

I'm using Python 3.5.1 and Scrapy 1.1.0

我把我的全部代码放在这里:https://github.com/prashcr/scrapy_tutorial,以便您可以根据需要尝试和修改它.

I put up my entire code here: https://github.com/prashcr/scrapy_tutorial, so that you can try and modify it as you wish.

推荐答案

我怀疑文档有误导性/错误(或者可能已经过时?),因为根据源代码,input_processor 字段属性被读取 仅在 ItemLoader 实例内,这意味着你无论如何都需要使用项目加载器.

I suspect the documentation is misleading/wrong (or may be out of date?), because, according to the source code, the input_processor field attribute is read only inside the ItemLoader instance, which means that you need to use an Item Loader anyway.

您可以使用内置的并保留您的 DmozItem 定义:

You can use a built-in one and leave your DmozItem definition as is:

from scrapy.loader import ItemLoader

class DmozSpider(scrapy.Spider):
    # ...

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            loader = ItemLoader(DmozItem(), selector=sel)
            loader.add_xpath('title', 'a/text()')
            loader.add_xpath('link', 'a/@href')
            loader.add_xpath('desc', 'text()')
            yield loader.load_item()

通过这种方式,input_processoroutput_processor 项目字段参数将被考虑在内并应用处理器.

This way the input_processor and output_processor Item Field arguments would be taken into account and the processors would be applied.

或者您可以在自定义项目加载器中定义处理器而不是Item类:

Or you can define the processors inside a custom Item Loader instead of the Item class:

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()


class MyItemLoader(ItemLoader):
    desc_in = MapCompose(
        lambda x: ' '.join(x.split()),
        lambda x: x.upper()
    )

    desc_out = Join()

并使用它在您的蜘蛛中加载项目:

And use it to load items in your spider:

def parse(self, response):
    for sel in response.xpath('//ul/li'):
        loader = MyItemLoader(DmozItem(), selector=sel)
        loader.add_xpath('title', 'a/text()')
        loader.add_xpath('link', 'a/@href')
        loader.add_xpath('desc', 'text()')
        yield loader.load_item()

这篇关于为什么我在 Scrapy 中的输入/输出处理器不工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-16 02:58