问题描述
在得到 SO 社区的大量帮助后,我有一个爬虫爬虫可以保存它爬取的网站的网页,但我想清理创建的 csv 文件 --output
示例行当前看起来像
"[{'url': 'http://example.com/page', 'path': 'full/hashedfile', 'checksum': 'checksumvalue'}]",http://example.com/page,2016-06-20 16:10:24.824000,http://example.com/page,My Example Page
如何让 csv 文件包含每行 1 个文件的详细信息(没有额外的 url:)并且路径值包括 .html 或 .txt 等扩展名?
我的items.py如下
class MycrawlerItem(scrapy.Item):# 在此处为您的项目定义字段,例如:# name = scrapy.Field()标题 = scrapy.Field()crawldate = scrapy.Field()pageurl = scrapy.Field()文件 = scrapy.Field()file_urls = scrapy.Field()
我的回调规则是
def scrape_page(self,response):page_soup = BeautifulSoup(response.body,"html.parser")ScrapedPageTitle = page_soup.title.get_text()item = MycrawlerItem()item['title'] =ScrapedPageTitleitem['crawldate'] = datetime.datetime.now()item['pageurl'] = response.urlitem['file_urls'] = [response.url]产量项目
在爬虫日志中显示
2016-06-20 16:10:26 [scrapy] DEBUG:从<200 http://example.com/page>{'crawldate': datetime.datetime(2016, 6, 20, 16, 10, 24, 824000),'file_urls': ['http://example.com/page'],'文件':[{'校验和':'校验和值','path': '完整/散列文件','url': 'http://example.com/page'}],'pageurl': 'http://example.com/page','title': u'我的示例页面'}
每个 csv 行的理想结构是
抓取日期、文件网址、文件路径、标题
如果您想要自定义格式等,您可能只想使用好的 ol' scrapy item 管道.
在管道方法 process_item
或 close_spider
中,您可以将项目写入文件.喜欢:
def process_item(self, item, spider):如果不是 getattr(spider, 'csv', False):归还物品with open('{}.csv'.format(spider.name), 'a') as f:作家 = csv.writer(f)writer.writerow([item['crawldate'],item['title']])归还物品
每次运行带有 csv
标志的蜘蛛时,这都会写出 .csv
文件,即 scrapy crawl twitter -a csv=True
如果您在 spider_open
方法中打开一个文件并在 spider_close
中关闭它,您可以使这更有效,但其他方面是一样的.
After alot of help from the SO community I have a scrapy crawler which saves the webpage of the site it crawls but I'd like to clean up the csv file that gets created --output
A sample row currently looks like
"[{'url': 'http://example.com/page', 'path': 'full/hashedfile', 'checksum': 'checksumvalue'}]",http://example.com/page,2016-06-20 16:10:24.824000,http://example.com/page,My Example Page
my items.py is as follows
class MycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
crawldate = scrapy.Field()
pageurl = scrapy.Field()
files = scrapy.Field()
file_urls = scrapy.Field()
My rules callback is
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = MycrawlerItem()
item['title'] =ScrapedPageTitle
item['crawldate'] = datetime.datetime.now()
item['pageurl'] = response.url
item['file_urls'] = [response.url]
yield item
In the crawler log it shows
2016-06-20 16:10:26 [scrapy] DEBUG: Scraped from <200 http://example.com/page>
{'crawldate': datetime.datetime(2016, 6, 20, 16, 10, 24, 824000),
'file_urls': ['http://example.com/page'],
'files': [{'checksum': 'checksumvalue',
'path': 'full/hashedfile',
'url': 'http://example.com/page'}],
'pageurl': 'http://example.com/page',
'title': u'My Example Page'}
The ideal structure for each csv line would be
If you want custom formats and such you probably want to just use good ol' scrapy item pipelines.
in pipelines methods process_item
or close_spider
you can write your item to file. Like:
def process_item(self, item, spider):
if not getattr(spider, 'csv', False):
return item
with open('{}.csv'.format(spider.name), 'a') as f:
writer = csv.writer(f)
writer.writerow([item['crawldate'],item['title']])
return item
This will write out <spider_name>.csv
file every time you run the spider with csv
flag, i.e. scrapy crawl twitter -a csv=True
You can make this more efficient if you open a file in spider_open
method and close it in spider_close
, but it's the same thing otherwise.
这篇关于使用文件项目管道时如何整理scrapy的csv输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!