问题描述
我一直试图通过构建爬虫来磨练我的 Python 技能,最近从 bs4 切换到了爬虫,以便我可以使用它的多线程和下载延迟功能.我已经能够制作一个基本的刮板并将数据输出到 csv,但是当我尝试添加递归功能时,我遇到了问题.我尝试遵循 Scrapy 递归下载内容 的建议,但不断收到以下错误:
I have been trying to hone my python skills by building scrapers and recently switched from bs4 to scrapy so that I can use its multithreading and download delay features. I have been able to make a basic scraper and output the data to csv, but when I try to add a recursive feature I run into problems. I tried following the advice from Scrapy Recursive download of Content but keep getting the following error:
调试:重试 http://medford.craigslist.org%20%5Bu'/cto/4359874426.html'%5D> DNS 查找失败:地址未找到
DEBUG: Retrying http://medford.craigslist.org%20%5Bu'/cto/4359874426.html'%5D> DNS lookup failed: address not found
这让我觉得我尝试加入链接的方式不起作用,因为它将字符插入到 url 中,但我不知道如何修复它.有什么建议吗?
This makes me think the way I am trying to join the links isn't work as it's inserting characters into the url, but I can't figure out how to fix it. Any advice?
代码如下:
#-------------------------------------------------------------------------------
# Name: module1
# Purpose:
#
# Author: CD
#
# Created: 02/03/2014
# Copyright: (c) CD 2014
# Licence: <your licence>
#-------------------------------------------------------------------------------
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *
class PageSpider(BaseSpider):
name = "cto"
start_urls = ["http://medford.craigslist.org/cto/"]
rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//p[@class="nextpage"]' ,))
, callback="parse", follow=True), )
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
for titles in titles:
item = CraigslistSampleItem()
item['title'] = titles.select("a/text()").extract()
item['link'] = titles.select("a/@href").extract()
url = "http://medford.craiglist.org %s" % item['link']
yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)
def parse_item_page(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
return item
推荐答案
证明你的代码:
url = "http://medford.craiglist.org %s" % item['link']
生成:
http://medford.craigslist.org [u'/cto/4359874426.html']
item['link']
在您的代码中返回一个列表,而不是您期望的字符串.你需要这样做:
The item['link']
returns a list in your code and not a string as you are expecting it to. You need to do this:
url = 'http://medford.craiglist.org{}'.format(''.join(item['link']))
这篇关于使用 Scrapy 在 Craigslist 上进行递归抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!