我想抓住从BBSmates.com网站的特定区域code BBSes的列表。该网站显示,每次20 BBS搜索结果,所以我要做的表单提交,以摆脱一个结果页下。
URL ='http://bbsmates.com/browsebbs.aspx?BBSName=&Area$c$c=314
BR = mechanize.Browser()
br.addheaders = [(用户代理,Mozilla的/ 5.0(X11; U; Linux的i686的; EN-US; rv中:的Gecko / 2008071615的Fedora / 3.0.1-1.fc9火狐/ 3.0。 1')]
响应= br.open(URL)HTML = response.read()br.select_form(名称='aspnetForm')
BR ['__ EVENTTARGET'] ='$ ctl00 $ ContentPlaceHolder1 GridView1
BR ['__ EVENTARGUMENT'] ='页面$ 79'
响应2 = br.submit()HTML2 = response2.read()
博客文章我上面提到提到,在他们的情况有一个的 SubmitControl 的一个问题,所以我想这个表格上禁用这两个SubmitControls。
br.find_control(ctl00 $ cmdLogin)。禁用= TRUE
br.find_control(ctl00 $ ContentPlaceHolder1 $ Button1的)。禁用= TRUE
禁用ContentPlaceHolder1 $ Button1的没有任何区别。提交通过了,但返回的页面还是第1页的搜索结果。
starturl ='http://bbsmates.com/browsebbs.aspx?BBSName=&Area$c$c=314
S = requests.session()#创建会话对象
R1 = s.get(starturl)#获取第1页
HTML = r1.text
根= lxml.html.fromstring(HTML)#pick了JavaScript的值
EVENTVALIDATION = root.xpath('//输入[@name =__ EVENTVALIDATION]')[0] .attrib ['值']
VIEWSTATE = root.xpath('//输入[@name =__ VIEWSTATE]')[0] .attrib ['值']
有效载荷= {'__EVENTTARGET: 'ctl00$ContentPlaceHolder1$GridView1','__EVENTARGUMENT':'Page$25','__EVENTVALIDATION':EVENTVALIDATION,'__VIEWSTATE':VIEWSTATE,'__VIEWSTATEENCRYPTED':'','ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZip$c$c':'','ctl00$ContentPlaceHolder1$txtArea$c$c':'314','ctl00$ContentPlaceHolder1$txt$p$pfix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'}
# 发表它
R2 = s.post(starturl,数据=净荷)
当你得到的结果的末尾(resultpage 21)你必须重新拿起VIEWSTATE和EVENTVALIDATION值(并每20页)。
I'm trying to scrape an ASP-powered site using ScraperWiki's tools.
I want to grab a list of BBSes in a particular area code from the BBSmates.com website. The site displays 20 BBS search results at a time, so I will have to do form submits in order to move from one page of results to the next.
This blog post helped me get started. I thought the following code would grab the final page of BBS listings for the 314 area code (page 79).
However, the response I get is the FIRST page.
url = 'http://bbsmates.com/browsebbs.aspx?BBSName=&AreaCode=314'
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open(url)
html = response.read()
br['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$GridView1'
br['__EVENTARGUMENT'] = 'Page$79'
print br.form
response2 = br.submit()
html2 = response2.read()
print html2
The blog post I cited above mentions that in their case there was a problem with a SubmitControl, so I tried disabling the two SubmitControls on this form.
br.find_control("ctl00$cmdLogin").disabled = True
Disabling cmdLogin generated HTTP Error 500.
br.find_control("ctl00$ContentPlaceHolder1$Button1").disabled = True
Disabling ContentPlaceHolder1$Button1 didn't make any difference. The submit went through, but the page it returned was still page 1 of the search results.
It's worth noting that this site does NOT use "Page$Next."
Can anyone help me figure out what I need to do to get ASPX form submit to work?
You need to post the values the page gives (EVENTVALIDATION, VIEWSTATE, etc.).
This code will work (note that it uses the awesome Requests library and not Mechanize)
import lxml.html
import requests
starturl = 'http://bbsmates.com/browsebbs.aspx?BBSName=&AreaCode=314'
s = requests.session() # create a session object
r1 = s.get(starturl) #get page 1
html = r1.text
root = lxml.html.fromstring(html)
#pick up the javascript values
EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]')[0].attrib['value']
#find the __EVENTVALIDATION value
VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]')[0].attrib['value']
#find the __VIEWSTATE value
# build a dictionary to post to the site with the values we have collected. The __EVENTARGUMENT can be changed to fetch another result page (3,4,5 etc.)
payload = {'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1','__EVENTARGUMENT':'Page$25','__EVENTVALIDATION':EVENTVALIDATION,'__VIEWSTATE':VIEWSTATE,'__VIEWSTATEENCRYPTED':'','ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZipCode':'','ctl00$ContentPlaceHolder1$txtAreaCode':'314','ctl00$ContentPlaceHolder1$txtPrefix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'}
# post it
r2 = s.post(starturl, data=payload)
# our response is now page 2
print r2.text
When you get to the end of the results (resultpage 21) you have to pick up the VIEWSTATE and EVENTVALIDATION values again (and do that every 20 pages).
Note that there are a few values that you post that are empty, and a few that include values. The full list is like this:
Here is a discussion on the Scraperwiki mailing list on a similar problem: https://groups.google.com/forum/#!topic/scraperwiki/W0Xi7AxfZp0