使用文件项目管道时如何整理scrapy的csv输出

使用文件项目管道时如何整理scrapy的csv输出

本文介绍了使用文件项目管道时如何整理scrapy的csv输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在得到 SO 社区的大量帮助后,我有一个爬虫爬虫可以保存它爬取的网站的网页,但我想清理创建的 csv 文件 --output

示例行当前看起来像

"[{'url': 'http://example.com/page', 'path': 'full/hashedfile', 'checksum': 'checksumvalue'}]",http://example.com/page,2016-06-20 16:10:24.824000,http://example.com/page,My Example Page

如何让 csv 文件包含每行 1 个文件的详细信息(没有额外的 url:)并且路径值包括 .html 或 .txt 等扩展名?

我的items.py如下

class MycrawlerItem(scrapy.Item):# 在此处为您的项目定义字段,例如:# name = scrapy.Field()标题 = scrapy.Field()crawldate = scrapy.Field()pageurl = scrapy.Field()文件 = scrapy.Field()file_urls = scrapy.Field()

我的回调规则是

def scrape_page(self,response):page_soup = BeautifulSoup(response.body,"html.parser")ScrapedPageTitle = page_soup.title.get_text()item = MycrawlerItem()item['title'] =ScrapedPageTitleitem['crawldate'] = datetime.datetime.now()item['pageurl'] = response.urlitem['file_urls'] = [response.url]产量项目

在爬虫日志中显示

2016-06-20 16:10:26 [scrapy] DEBUG:从<200 http://example.com/page>{'crawldate': datetime.datetime(2016, 6, 20, 16, 10, 24, 824000),'file_urls': ['http://example.com/page'],'文件':[{'校验和':'校验和值','path': '完整/散列文件','url': 'http://example.com/page'}],'pageurl': 'http://example.com/page','title': u'我的示例页面'}

每个 csv 行的理想结构是

抓取日期、文件网址、文件路径、标题

解决方案

如果您想要自定义格式等,您可能只想使用好的 ol' scrapy item 管道.

在管道方法 process_itemclose_spider 中,您可以将项目写入文件.喜欢:

def process_item(self, item, spider):如果不是 getattr(spider, 'csv', False):归还物品with open('{}.csv'.format(spider.name), 'a') as f:作家 = csv.writer(f)writer.writerow([item['crawldate'],item['title']])归还物品

每次运行带有 csv 标志的蜘蛛时,这都会写出 .csv 文件,即 scrapy crawl twitter -a csv=True

如果您在 spider_open 方法中打开一个文件并在 spider_close 中关闭它,您可以使这更有效,但其他方面是一样的.

After alot of help from the SO community I have a scrapy crawler which saves the webpage of the site it crawls but I'd like to clean up the csv file that gets created --output

A sample row currently looks like

"[{'url': 'http://example.com/page', 'path': 'full/hashedfile', 'checksum': 'checksumvalue'}]",http://example.com/page,2016-06-20 16:10:24.824000,http://example.com/page,My Example Page

my items.py is as follows

class MycrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    crawldate = scrapy.Field()
    pageurl = scrapy.Field()
    files = scrapy.Field()
    file_urls = scrapy.Field()

My rules callback is

def scrape_page(self,response):
    page_soup = BeautifulSoup(response.body,"html.parser")
    ScrapedPageTitle = page_soup.title.get_text()
    item = MycrawlerItem()
    item['title'] =ScrapedPageTitle
    item['crawldate'] = datetime.datetime.now()
    item['pageurl'] = response.url
    item['file_urls'] = [response.url]
    yield item

In the crawler log it shows

2016-06-20 16:10:26 [scrapy] DEBUG: Scraped from <200 http://example.com/page>
{'crawldate': datetime.datetime(2016, 6, 20, 16, 10, 24, 824000),
 'file_urls': ['http://example.com/page'],
 'files': [{'checksum': 'checksumvalue',
            'path': 'full/hashedfile',
            'url': 'http://example.com/page'}],
 'pageurl': 'http://example.com/page',
 'title': u'My Example Page'}

The ideal structure for each csv line would be

解决方案

If you want custom formats and such you probably want to just use good ol' scrapy item pipelines.

in pipelines methods process_item or close_spider you can write your item to file. Like:

def process_item(self, item, spider):
    if not getattr(spider, 'csv', False):
        return item
    with open('{}.csv'.format(spider.name), 'a') as f:
        writer = csv.writer(f)
        writer.writerow([item['crawldate'],item['title']])
    return item

This will write out <spider_name>.csv file every time you run the spider with csv flag, i.e. scrapy crawl twitter -a csv=True

You can make this more efficient if you open a file in spider_open method and close it in spider_close, but it's the same thing otherwise.

这篇关于使用文件项目管道时如何整理scrapy的csv输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 15:06