问题描述
我正在尝试使用 Scrapy 抓取网址.但它会将我重定向到不存在的页面.
I'm trying to crawl a url using Scrapy. But it redirects me to page that doesn't exist.
Redirecting (302) to <GET http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197> from <GET http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx>
问题是http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx 存在,但 http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197 没有,所以爬虫找不到这个.我也爬过许多其他网站,但在其他任何地方都没有这个问题.有什么办法可以阻止这种重定向?
The problem is http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx exists, but http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197 doesn't, so the crawler cant find this. I've crawled many other websites as well but didn't have this problem anywhere else. Is there a way I can stop this redirect?
任何帮助将不胜感激.谢谢.
Any help would be much appreciated. Thanks.
更新:这是我的蜘蛛类
class Inon_Spider(BaseSpider):
name = 'Inon'
allowed_domains = ['www.shop.inonit.in']
start_urls = ['http://www.shop.inonit.in/Products/Inonit-Gadget-Accessories-Mobile-Covers/-The-Red-Tag/Samsung-Note-2-Dead-Mau/pid-2656465.aspx']
def parse(self, response):
item = DealspiderItem()
hxs = HtmlXPathSelector(response)
title = hxs.select('//div[@class="aboutproduct"]/div[@class="container9"]/div[@class="ctl_aboutbrand"]/h1/text()').extract()
price = hxs.select('//span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_spnWebPrice"]/span[@class="offer"]/span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_lblOfferPrice"]/text()').extract()
prc = price[0].replace("Rs. ","")
description = []
item['price'] = prc
item['title'] = title
item['description'] = description
item['url'] = response.url
return item
推荐答案
是的,您只需添加元值,例如
yes you can do this simply by adding meta values like
meta={'dont_redirect': True}
您也可以停止重定向特定的响应代码,例如
also you can stop redirected for a particular response code like
meta={'dont_redirect': True,"handle_httpstatus_list": [302]}
它将停止仅重定向 302 响应代码.您可以添加任意数量的 http 状态代码以避免重定向它们.
it will stop redirecting only 302 response codes. you can add as many http status code you want to avoid redirecting them.
示例
yield Request('some url',
meta = {
'dont_redirect': True,
'handle_httpstatus_list': [302]
},
callback= self.some_call_back)
这篇关于scrapy-如何停止重定向(302)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!