我想抓取this website。我写了一个蜘蛛,但它只爬在首页上,即前52个项目上。
我已经试过这段代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
a=[]
from aqaq.items import aqaqItem
import os
import urlparse
import ast
class aqaqspider(BaseSpider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = [
"http://www.jabong.com/women/clothing/womens-tops/",
]
def parse(self, response):
# ... Extract items in the page using extractors
n=3
ct=1
hxs = HtmlXPathSelector(response)
sites=hxs.select('//div[@id="page"]')
for site in sites:
name=site.select('//div[@id="content"]/div[@class="l-pageWrapper"]/div[@class="l-main"]/div[@class="box box-bgcolor"]/section[@class="box-bd pan mtm"]/ul[@id="productsCatalog"]/li/a/@href').extract()
print name
print ct
ct=ct+1
a.append(name)
req= Request (url="http://www.jabong.com/women/clothing/womens-tops/?page=" + str(n) ,
headers = {"Referer": "http://www.jabong.com/women/clothing/womens-tops/",
"X-Requested-With": "XMLHttpRequest"},callback=self.parse,dont_filter=True)
return req # and your items
它显示以下输出:
2013-10-31 09:22:42-0500 [jabong] DEBUG: Crawled (200) <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> (referer: http://www.jabong.com/women/clothing/womens-tops/)
2013-10-31 09:22:42-0500 [jabong] DEBUG: Filtered duplicate request: <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2013-10-31 09:22:42-0500 [jabong] INFO: Closing spider (finished)
2013-10-31 09:22:42-0500 [jabong] INFO: Dumping Scrapy stats:
当我输入
dont_filter=True
时,它将永远不会停止。 最佳答案
是的,此处必须使用dont_filter
,因为每次向下滚动页面时,XHR请求中只有page
GET参数会更改为http://www.jabong.com/women/clothing/womens-tops/?page=X
。
现在,您需要弄清楚如何停止爬网。这实际上很简单-只需检查队列中下一页上是否没有产品,然后提高 CloseSpider
exception即可。
这是一个适用于我的完整代码示例(在第234页停止):
import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import BaseSpider
from scrapy.http import Request
class Product(scrapy.Item):
brand = scrapy.Field()
title = scrapy.Field()
class aqaqspider(BaseSpider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = [
"http://www.jabong.com/women/clothing/womens-tops/?page=1",
]
page = 1
def parse(self, response):
products = response.xpath("//li[@data-url]")
if not products:
raise CloseSpider("No more products!")
for product in products:
item = Product()
item['brand'] = product.xpath(".//span[contains(@class, 'qa-brandName')]/text()").extract()[0].strip()
item['title'] = product.xpath(".//span[contains(@class, 'qa-brandTitle')]/text()").extract()[0].strip()
yield item
self.page += 1
yield Request(url="http://www.jabong.com/women/clothing/womens-tops/?page=%d" % self.page,
headers={"Referer": "http://www.jabong.com/women/clothing/womens-tops/", "X-Requested-With": "XMLHttpRequest"},
callback=self.parse,
dont_filter=True)
关于javascript - 如何通过无限滚动抓取网站?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/19709086/