问题描述
我的黑客新闻蜘蛛会在一行上输出所有的结果,而不是每一行,如这里所示。
My Hacker News spider outputs all the results on one line, instead of one each line, as it can be seen here.
这是我的代码。 / p>
Here is my code.
import scrapy
import string
import urlparse
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors import LinkExtractor
class HnItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
score = scrapy.Field()
class HnSpider(scrapy.Spider):
name = 'hackernews'
allowed_domains = ["news.ycombinator.com"]
start_urls = ["https://news.ycombinator.com/"]
def parse(self, response):
sel = response
selector_list = response.xpath('.//table[@class="itemlist"]')
for sel in selector_list:
item = HnItem()
item['title'] = sel.xpath('.//td[@class="title"]/text()').extract()
item['link'] = sel.xpath('.//tr[@class="athing"]/td[3]/a/@href').extract()
item['score'] = sel.xpath('.//td[@class="subtext"]/span/text()').extract()
yield item
和我的settings.py文件
and my settings.py file
BOT_NAME = 'hnews'
SPIDER_MODULES = ['hnews.spiders']
NEWSPIDER_MODULE = 'hnews.spiders'
USER_AGENT = 'hnews (+http://www.yourdomain.com)'
FEED_URI = '/used/scrapy/hnews/%(name)s/%(time)s.csv'
FEED_FORMAT = 'csv'
我试图实现在许多其他解决方案但没有运气到目前为止。
I've tried to implement this among many other solutions but no luck so far. I'm still very new at this, so bear with me if possible.
推荐答案
这是因为你的项目管道越来越多了所有的列表一次。对于expample: item ['title']
正在获取所有标题的列表,然后将其传输到项目管道,然后直接写入csv文件。
It is happening because your item pipeline is getting all the lists at once. For expample: The item['title']
is getting a list of all the titles at once which is then transferred to the item pipeline and then written to the csv file directly.
解决方案是遍历列表并一次一个地将其传递到项目管道。以下是修改后的代码:
The solution is to iterate over the list and yield it to the item pipeline one at a time. Here's a modified code:
import scrapy
from scrapy.selector import Selector
class HnItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
score = scrapy.Field()
class HnSpider(scrapy.Spider):
name = 'hackernews'
allowed_domains = ["news.ycombinator.com"]
start_urls = ["https://news.ycombinator.com/"]
def parse(self, response):
sel = Selector(response)
item = HnItem()
title_list = sel.xpath('.//td[@class="title"]/a/text()').extract()[:-2]
link_list= sel.xpath('.//tr[@class="athing"]/td[3]/a/@href').extract()
score_list = sel.xpath('.//td[@class="subtext"]/span/text()').extract()
for x in range(0,len(title_list)):
item['title'] = title_list[x]
item['link'] = link_list[x]
item['score'] = score_list[x]
yield item
这篇关于Scrapy管道以错误的csv格式提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!