问题描述
我制作了一个蜘蛛来从这样的页面获取评论此处 使用scrapy.我只想要某个日期之前的产品评论(在这种情况下是 2016 年 7 月 2 日).一旦审查日期早于给定日期,我想关闭我的蜘蛛并返回项目列表.Spider 运行良好,但我的问题是,如果满足条件,我将无法关闭我的蜘蛛……如果我引发异常,蜘蛛将关闭而不返回任何内容.请建议手动关闭蜘蛛的最佳方法.这是我的代码:
I have made a spider to get reviews from a page like this here using scrapy. I want product reviews only till a certain date(2nd July 2016 in this case). I want to close my spider as soon as the review date goes earlier than the given date and return the items list.Spider is working well but my problem is that i am not able to close my spider if the condition is met..if i raise an exception, spider closes without returning anything.Please suggest the best way to close the spider manually. Here is my code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Selector
from tars.items import FlipkartProductReviewsItem
import re as r
import unicodedata
from datetime import datetime
class Freviewspider(CrawlSpider):
name = "frs"
allowed_domains = ["flipkart.com"]
def __init__(self, *args, **kwargs):
super(Freviewspider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="nav_bar_next_prev"]')), callback="parse_start_url", follow= True),
)
def parse_start_url(self, response):
hxs = Selector(response)
titles = hxs.xpath('//div[@class="fclear fk-review fk-position-relative line "]')
items = []
for i in titles:
item = FlipkartProductReviewsItem()
#x-paths:
title_xpath = "div[2]/div[1]/strong/text()"
review_xpath = "div[2]/p/span/text()"
date_xpath = "div[1]/div[3]/text()"
#field-values-extraction:
item["date"] = (''.join(i.xpath(date_xpath).extract())).replace('\n ', '')
item["title"] = (''.join(i.xpath(title_xpath).extract())).replace('\n ', '')
review_list = i.xpath(review_xpath).extract()
temp_list = []
for element in review_list:
temp_list.append(element.replace('\n ', '').replace('\n', ''))
item["review"] = ' '.join(temp_list)
xxx = datetime.strptime(item["date"], '%d %b %Y ')
comp_date = datetime.strptime('02 Jul 2016 ', '%d %b %Y ')
if xxx>comp_date:
items.append(item)
else:
break
return(items)
推荐答案
要强制关闭蜘蛛,您可以使用引发 CloseSpider
异常,如这里是scrapy docs.请务必在引发异常之前退回/让出您的物品.
To force spider to close you can use raise CloseSpider
exception as described here in scrapy docs. Just be sure to return/yield your items before you raise the exception.
这篇关于满足条件时关闭scrapy蜘蛛并返回输出对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!