本文介绍了在scrapy中将基本网址与结果href结合起来的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
下面是我的蜘蛛代码,
class Blurb2Spider(BaseSpider):
name = "blurb2"
allowed_domains = ["www.domain.com"]
def start_requests(self):
yield self.make_requests_from_url("http://www.domain.com/bookstore/new")
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract()
for i in urls:
yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
print response,'------->'
在这里,我试图将 href 链接与基本链接结合起来,但出现以下错误,
Here i am trying to combine the href link with the base link , but i am getting the following error ,
exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do
谁能告诉我为什么我会收到此错误以及如何将基本网址与 href 链接连接并产生请求
Can anyone let me know why i am getting this error and how to join base url with href link and yield a request
推荐答案
这是因为您没有添加方案,例如在您的基本 url 中添加 http://.
It is because you didn't add the scheme, eg http:// in your base url.
试试:urlparse.urljoin('http://www.domain.com/', i[1:])
或者更简单:urlparse.urljoin(response.url, i[1:])
因为 urlparse.urljoin 将整理出基本 URL 本身.
Or even more easy: urlparse.urljoin(response.url, i[1:])
as urlparse.urljoin will sort out the base URL itself.
这篇关于在scrapy中将基本网址与结果href结合起来的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!