在学习《python爬虫开发与项目实践》的时候有一个关于CrawlSpider的例子,当我在运行时发现,没有爬取到任何数据,以下是我敲的源代码:
import scrapy
from UseScrapyProject.items import UsescrapyprojectItem
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from scrapy import Selector from scrapy.utils.project import get_project_settings class SpiderUserCrawlSpider(CrawlSpider):
name ="secondSpider"
allow_domains=['cnblogs.com']
start_urls=['http://www.cnblogs.com/qiyeboy/default.html?page=1']
links = LinkExtractor(allow="/qiyeboy/default.html?page=\d{1,}") #rules是CrawlSpider比scrapy.Spider新增的参数
rules = (
Rule(link_extractor=links,follow=True,callback="parse_item"),
)
#Rule的回调函数
def parse_item(self,response):
papers=response.xpath(".//*[@class='day']")
#从每篇文章中抽取数据
for paper in papers:
date=paper.xpath(".//*[@class='dayTitle']/a/text()").extract()[0]
title = paper.xpath(".//*[@class='postTitle']/a/text()").extract()[0]
content = paper.xpath(".//*[@class='postCon']/div/text()").extract()[0]
url = paper.xpath(".//*[@class='postTitle']/a/@href").extract()[0]
# 把抽取到的数据结构化
item = UsescrapyprojectItem(date=date, title=title, content=content, url=url) request = scrapy.Request(url=url, callback=self.parse_body)
request.meta['item'] = item # 将item暂存
yield request next_page=Selector(response).re(u'<a href="(\S*)">下一页</a>')
if next_page:#如果存在下一页
#返回一个请求,url参数是下一页的地址,callback是回调的函数(这里的是把下一页的响应返回到parse函数继续抽取信息)
yield scrapy.Request(url=next_page[0],callback=self.parse) def parse_body(self,response):
item=response.meta['item']
body=response.xpath('.//*[@class="postBody"]')
item['image_urls']=body.xpath('.//img/@src').extract()
yield item if __name__ == '__main__':
# settings = get_project_settings() # 启动爬虫方法一
process=CrawlerProcess({
'USER_AGENT':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
})
process.settings = get_project_settings()
process.crawl(SpiderUserCrawlSpider)
# 如果有多个Spider,可以写多几个
#process.crawl(SecondTestSpider)
process.start() 上面是spider的源代码,一直没有爬取到结果,断点调试的时候发现根本没有跑进parse_item,网上查了很久,很多人表示callback不能使用默认的parse,可是我这里已经是parse_item了
然后尝试重新给parse_item取了别的名字,仍然是不行
直到看见某帖子里有人提到没有匹配到连接所以没有爬到数据,于是,就去检查LinkExtractor(allow="/qiyeboy/default.html?page=\d{1,}")中的正则表达式,
果然,当我使用Regex Match Tracer 2.1去验证的时候发现这个表达式发现果然是没法匹配到http://www.cnblogs.com/qiyeboy/default.html?page=1的链接。
再次检查发现表达式的问号没有转义
当改为
links = LinkExtractor(allow="/qiyeboy/default.html\?page=\d{1,}")
就能正常爬取数据,并保存为json,以及下载图片
04-23 05:58