问题描述
在这段代码中,我想抓取链接内的标题、副标题和数据,但有1 和 2 以外的页面上的问题是只抓取了 1 个项目.我只想提取那些标题为 delhivery 的条目
In this code I want to scrape title,subtitle and data inside the links but having issues on pages beyond 1 and 2 as getting only 1 item scraped.I want to extract only those entries having title as delhivery only
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from delhivery.items import DelhiveryItem
class criticspider(CrawlSpider):
name = "delh"
allowed_domains = ["consumercomplaints.in"]
start_urls = ["http://www.consumercomplaints.in/?search=delhivery&page=2"]
def parse(self, response):
sites = response.xpath('//table[@width="100%"]')
items = []
for site in sites:
item = DelhiveryItem()
item['title'] = site.xpath('.//td[@class="complaint"]/a/span[@style="background-color:yellow"]/text()').extract()[0]
#item['title'] = site.xpath('.//td[@class="complaint"]/a[text() = "%s Delivery Courier %s"]/text()').extract()[0]
item['subtitle'] = site.xpath('.//td[@class="compl-text"]/div/b[1]/text()').extract()[0]
item['date'] = site.xpath('.//td[@class="small"]/text()').extract()[0].strip()
item['username'] = site.xpath('.//td[@class="small"]/a[2]/text()').extract()[0]
item['link'] = site.xpath('.//td[@class="complaint"]/a/@href').extract()[0]
if item['link']:
if 'http://' not in item['link']:
item['link'] = urljoin(response.url, item['link'])
yield scrapy.Request(item['link'],
meta={'item': item},
callback=self.anchor_page)
items.append(item)
def anchor_page(self, response):
old_item = response.request.meta['item']
old_item['data'] = response.xpath('.//td[@style="padding-bottom:15px"]/div/text()').extract()[0]
yield old_item
推荐答案
你需要把 item['title'] 改成这样:
You need to change the item['title'] to this:
item['title'] = ''.join(site.xpath('//table[@width="100%"]//span[text() = "Delhivery"]/parent::*//text()').extract()[0])
还要编辑站点以仅提取所需的链接(包含Delhivery 的链接)
Also edit sites to this to extract the required links only (ones with Delhivery in it)
sites = response.xpath('//table//span[text()="Delhivery"]/ancestor::div')
所以我现在明白您需要在代码中添加分页规则.它应该是这样的:您只需要添加导入并从 项目的链接本身编写新的 xpath,例如 这个
so I understand now that you need to add a pagination rule to your code.it should be something like this:You just need to add your imports and write the new xpaths from the item's link itself, such as this one
class criticspider(CrawlSpider):
name = "delh"
allowed_domains = ["consumercomplaints.in"]
start_urls = ["http://www.consumercomplaints.in/?search=delhivery"]
rules = (
# Extracting pages, allowing only links with page=number to be extracted
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]', ), allow=('page=\d+', ),unique=True),follow=True),
# Extract links of items on each page the spider gets from the first rule
Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="complaint"]', )), callback='parse_item'),
)
def parse_item(self, response):
item = DelhiveryItem()
#populate item object here the same way you did, this function will be called for each item link.
#This meand that you'll be extracting data from pages like this one :
#http://www.consumercomplaints.in/complaints/delhivery-last-mile-courier-service-poor-delivery-service-c772900.html#c1880509
item['title'] = response.xpath('<write xpath>').extract()[0]
item['subtitle'] = response.xpath('<write xpath>').extract()[0]
item['date'] = response.xpath('<write xpath>').extract()[0].strip()
item['username'] = response.xpath('<write xpath>').extract()[0]
item['link'] = response.url
item['data'] = response.xpath('<write xpath>').extract()[0]
yield item
此外,我建议您在编写 xpath 时,不要使用任何样式参数,尝试使用 @class 或 @id,如果这是唯一的方法,则仅使用 @width、@style 或任何样式参数.
Also I suggest when you write an xpath, that you don't use any styling parameters, try to use @class or @id, only use @width, @style or any styling params if it's the only way.
这篇关于scrapy 在提取标题时无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!