问题描述
我正在尝试抓取此页面:http://www.nitt.edu/prm/nitreg/ShowRes.aspx
I am tryring to scrape this page:http://www.nitt.edu/prm/nitreg/ShowRes.aspx
代码如下:
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.nitt.edu/prm/nitreg/ShowRes.aspx',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'http://www.nitt.edu/prm/nitreg/ShowRes.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATE"})
eventvalidation = soup.findAll("input", {"type": "hidden", "name": "__EVENTVALIDATION"})
print viewstate[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEENCRYPTED',''),
('TextBox1', '106110006'),
('Button1', 'Show'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
try:
# actually we'd better use BeautifulSoup once again to
# retrieve results(instead of writing out the whole HTML file)
# Besides, since the result is split into multipages,
# we need send more HTTP requests
fout = open('tmp.html', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
我不断收到服务器错误:源错误:
I keep getting a server error:Source Error:
在执行当前 Web 请求期间生成了未处理的异常.可以使用下面的异常堆栈跟踪来识别有关异常来源和位置的信息.
An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below.
堆栈跟踪:
[FormatException: Invalid character in a Base-64 string.]
System.Convert.FromBase64String(String s) +0
System.Web.UI.LosFormatter.Deserialize(String input) +25
System.Web.UI.Page.LoadPageStateFromPersistenceMedium() +101
[HttpException (0x80004005): Invalid_Viewstate
Client IP: 10.0.0.166
Port: 51915
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
ViewState: [<input name="__VIEWSTATE" type="hidden" value="dDwtMTM3NzI1MDM3O3Q8O2w8aTwxPjs+O2w8dDw7bDxpPDE+O2k8Mj47PjtsPHQ8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+O2w8aTwxPjtpPDM+Oz47bDx0PDtsPGk8Mz47PjtsPHQ8O2w8aTwwPjs+O2w8dDw7bDxpPDE+Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs+Ozs+Oz4+Oz4+Oz4+O3Q8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs+Oz4+O3Q8O2w8aTw5PjtpPDExPjs+O2w8dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDx0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjs7Pjs+Pjs+Pjs+Pjs+zHrNhAd1tTLXbBUyAJRtS6omUc0="/>]
Http-Referer:
Path: /prm/nitreg/ShowRes.aspx.]
System.Web.UI.Page.LoadPageStateFromPersistenceMedium() +447
System.Web.UI.Page.LoadPageViewState() +18
System.Web.UI.Page.ProcessRequestMain() +447
Base-64 字符串中的无效字符.有什么问题?
Invalid character in a Base-64 string.What is the problem?
推荐答案
您使用的是 ViewState 输入对象,而不是值.
You are using the ViewState input object, not the value.
视图状态:并[d输入名称= __ VIEWSTATE" 类型= 隐藏的" 值=dDwtMTM3NzI1MDM3O3Q8O2w8aTwxPjs + O2w8dDw7bDxpPDE + O2k8Mj47PjtsPHQ8cDxwPGw8VmlzaWJsZTs + O2w8bzxmPjs + PJS + O2w8aTwxPjtpPDM + Oz47bDx0PDtsPGk8Mz47PjtsPHQ8O2w8aTwwPjs + O2w8dDw7bDxpPDE + Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs +盎司+ Oz4 + Oz4 + Oz4+ O3Q8cDxwPGw8VmlzaWJsZTs + O2w8bzxmPjs + PJS +盎司+ Oz4 + O3Q8O2w8aTw5PjtpPDExPjs + O2w8dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY + Oz4 + Oz47Oz47dDx0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjs7Pjs + PJS + PJS + PJS + zHrNhAd1tTLXbBUyAJRtS6omUc0 =/>]
你的 formData
应该是:
formData = (
('__EVENTVALIDATION', eventvalidation[0]['value']),
('__VIEWSTATE', viewstate[0]['value']),
('__VIEWSTATEENCRYPTED',''),
('TextBox1', '106110006'),
('Button1', 'Show'),
)
注意你的 eventvalidation 值有同样的问题,我也修复了.
Note your eventvalidation value has the same issue, I fixed it too.
__EVENTVALIDATION 在该页面中不存在.您可以从 formData
中删除 __EVENTVALIDATION
.
The __EVENTVALIDATION does not exist in that page. You can just remove __EVENTVALIDATION
from formData
.
formData = (
('__VIEWSTATE', viewstate[0]['value']),
('__VIEWSTATEENCRYPTED',''),
('TextBox1', '106110006'),
('Button1', 'Show'),
)
这篇关于使用 BeautifulSoup 用 Python 抓取 aspx 网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!