本文介绍了拆分scrapy的大CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
是否可以将scrapy写入每行不超过5000行的CSV文件?我怎样才能给它一个自定义命名方案?我应该修改 CsvItemExporter
吗?
Is it possible to make scrapy write to CSV files with not more than 5000 rows in each one? How can I give it a custom naming scheme? Am I supposed to modify CsvItemExporter
?
推荐答案
试试这个管道:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import CsvItemExporter
import datetime
class MyPipeline(object):
def __init__(self, stats):
self.stats = stats
self.base_filename = "result/amazon_{}.csv"
self.next_split = self.split_limit = 50000 # assuming you want to split 50000 items/csv
self.create_exporter()
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
def create_exporter(self):
now = datetime.datetime.now()
datetime_stamp = now.strftime("%Y%m%d%H%M")
self.file = open(self.base_filename.format(datetime_stamp),'w+b')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def process_item(self, item, spider):
if (self.stats.get_stats()['item_scraped_count'] >= self.next_split):
self.next_split += self.split_limit
self.exporter.finish_exporting()
self.file.close()
self.create_exporter
self.exporter.export_item(item)
return item
不要忘记将管道添加到您的设置中:
Don't forget to add the pipeline to your setting:
ITEM_PIPELINES = {
'myproject.pipelines.MyPipeline': 300,
}
这篇关于拆分scrapy的大CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!