本文介绍了使用scrapy来抓取带有javascript按钮和ajax请求的asp.net网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我一直试图从asp.net网站上抓取一些日期,起始页应为以下内容: http://www.e3050.com/Items.aspx?cat=SON
I'd been trying to scrape some date from as asp.net website, the start page should be the following one:http://www.e3050.com/Items.aspx?cat=SON
首先,我想每页显示50个项目(来自select元素)其次,我要分页浏览.
First, I want to display 50 item per page (from the select element)Second, I want to paginate through pages.
我尝试了以下代码,每页50个项目,但是没有用:
I tried the following code for 50 items per page, but didn't work:
start_urls = ["http://www.e3050.com/Items.aspx?cat=SON"]
def parse(self, response):
requests = []
hxs = HtmlXPathSelector(response)
# Check if there's more than 1 page
if len(hxs.select('//span[@id="ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_PageSize"]/text()').extract()) > 0:
# Get last page number
last_page = hxs.select('//span[@id="ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_PageSize"]/text()').extract()[0]
i = 1
# preparing requests for each page
while i < (int(last_page) / 5) + 1:
requests.append(Request("http://www.e3050.com/Items.aspx?cat=SON", callback=self.parse_product))
i +=1
# posting form date (50 items and next page button)
requests.append(FormRequest.from_response(
response,
formdata={'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pagesddl':'50',
'__EVENTTARGET':'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pager1$ctl00$ctl01'},
callback=self.parse_product,
dont_click=True
)
)
for request in requests:
yield request
推荐答案
在这里查看这是一个精确的解决方案.
Check out this here is an exact solution..
在解析方法中,每页选择50种产品
in parse method selecting 50 products per page
在page_rs_50中处理了分页
in page_rs_50 handled pagination
start_urls = ['http://www.e3050.com/Items.aspx?cat=SON']
pro_urls = [] # all product Urls
def parse(self, response): # select 50 products on each page
yield FormRequest.from_response(response,
formdata={'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pagesddl': '50',
'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$sortddl': 'Price(ASC)'},
meta={'curr': 1, 'total': 0, 'flag': True},
dont_click=True,
callback=self.page_rs_50)
def page_rs_50(self, response): # paginate the pages
hxs = HtmlXPathSelector(response)
curr = int(response.request.meta['curr'])
total = int(response.request.meta['total'])
flag = response.request.meta['flag']
self.pro_urls.extend(hxs.select(
"//td[@class='name']//a[contains(@id,'ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_itemslv_ctrl')]/@href"
).extract())
if flag:
total = hxs.select(
"//span[@id='ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_pagesizeBtm']/text()").re('\d+')[0]
if curr < total:
curr += 1
yield FormRequest.from_response(response,
formdata={'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pagesddl': '50',
'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$sortddl': 'Price(ASC)',
'ctl00$ctl00$ScriptManager1': 'ctl00$ctl00$ScriptManager1|ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pager1$ctl00$ctl01'
, '__EVENTTARGET': 'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pager1$ctl00$ctl01',
'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$hfVSFileName': hxs.select(
".//input[@id='ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_hfVSFileName']/@value").extract()[
0]},
meta={'curr': curr, 'total': total, 'flag': False},
dont_click=True,
callback=self.page_rs_50
)
else:
for pro in self.pro_urls:
yield Request("http://www.e3050.com/%s" % pro,
callback=self.parse_product)
def parse_product(self, response):
pass
#TODO Implementation Required For Parsing
这篇关于使用scrapy来抓取带有javascript按钮和ajax请求的asp.net网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!