本文介绍了使用 BeautifulSoup 用 Python 抓取 aspx 网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取此页面:http://www.nitt.edu/prm/nitreg/ShowRes.aspx

I am tryring to scrape this page:http://www.nitt.edu/prm/nitreg/ShowRes.aspx

代码如下:

import urllib
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Origin': 'http://www.indiapost.gov.in',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'http://www.nitt.edu/prm/nitreg/ShowRes.aspx',
    'Accept-Encoding': 'gzip,deflate,sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'http://www.nitt.edu/prm/nitreg/ShowRes.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATE"})
eventvalidation = soup.findAll("input", {"type": "hidden", "name": "__EVENTVALIDATION"})

print viewstate[0]['value']





formData = (
     ('__EVENTVALIDATION', eventvalidation),
    ('__VIEWSTATE', viewstate),
    ('__VIEWSTATEENCRYPTED',''),
    ('TextBox1', '106110006'),
    ('Button1', 'Show'),
)

encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)

try:
    # actually we'd better use BeautifulSoup once again to
    # retrieve results(instead of writing out the whole HTML file)
    # Besides, since the result is split into multipages,
    # we need send more HTTP requests
    fout = open('tmp.html', 'w')
except:
    print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()

我不断收到服务器错误:源错误:

I keep getting a server error:Source Error:

在执行当前 Web 请求期间生成了未处理的异常.可以使用下面的异常堆栈跟踪来识别有关异常来源和位置的信息.

An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below.

堆栈跟踪:

[FormatException: Invalid character in a Base-64 string.]
   System.Convert.FromBase64String(String s) +0
   System.Web.UI.LosFormatter.Deserialize(String input) +25
   System.Web.UI.Page.LoadPageStateFromPersistenceMedium() +101

[HttpException (0x80004005): Invalid_Viewstate
    Client IP: 10.0.0.166
    Port: 51915
    User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
    ViewState: [<input name="__VIEWSTATE" type="hidden" value="dDwtMTM3NzI1MDM3O3Q8O2w8aTwxPjs+O2w8dDw7bDxpPDE+O2k8Mj47PjtsPHQ8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+O2w8aTwxPjtpPDM+Oz47bDx0PDtsPGk8Mz47PjtsPHQ8O2w8aTwwPjs+O2w8dDw7bDxpPDE+Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs+Ozs+Oz4+Oz4+Oz4+O3Q8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs+Oz4+O3Q8O2w8aTw5PjtpPDExPjs+O2w8dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDx0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjs7Pjs+Pjs+Pjs+Pjs+zHrNhAd1tTLXbBUyAJRtS6omUc0="/>]
    Http-Referer:
    Path: /prm/nitreg/ShowRes.aspx.]
   System.Web.UI.Page.LoadPageStateFromPersistenceMedium() +447
   System.Web.UI.Page.LoadPageViewState() +18
   System.Web.UI.Page.ProcessRequestMain() +447

Base-64 字符串中的无效字符.有什么问题?

Invalid character in a Base-64 string.What is the problem?

推荐答案

您使用的是 ViewState 输入对象,而不是值.

You are using the ViewState input object, not the value.

视图状态:并[d输入名称= __ VIEWSTATE" 类型= 隐藏的" 值=dDwtMTM3NzI1MDM3O3Q8O2w8aTwxPjs + O2w8dDw7bDxpPDE + O2k8Mj47PjtsPHQ8cDxwPGw8VmlzaWJsZTs + O2w8bzxmPjs + PJS + O2w8aTwxPjtpPDM + Oz47bDx0PDtsPGk8Mz47PjtsPHQ8O2w8aTwwPjs + O2w8dDw7bDxpPDE + Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs +盎司+ Oz4 + Oz4 + Oz4+ O3Q8cDxwPGw8VmlzaWJsZTs + O2w8bzxmPjs + PJS +盎司+ Oz4 + O3Q8O2w8aTw5PjtpPDExPjs + O2w8dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY + Oz4 + Oz47Oz47dDx0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjs7Pjs + PJS + PJS + PJS + zHrNhAd1tTLXbBUyAJRtS6omUc0 =/>]

你的 formData 应该是:

formData = (
     ('__EVENTVALIDATION', eventvalidation[0]['value']),
    ('__VIEWSTATE', viewstate[0]['value']),
    ('__VIEWSTATEENCRYPTED',''),
    ('TextBox1', '106110006'),
    ('Button1', 'Show'),
)

注意你的 eventvalidation 值有同样的问题,我也修复了.

Note your eventvalidation value has the same issue, I fixed it too.

__EVENTVALIDATION 在该页面中不存在.您可以从 formData 中删除 __EVENTVALIDATION.

The __EVENTVALIDATION does not exist in that page. You can just remove __EVENTVALIDATION from formData.

formData = (
    ('__VIEWSTATE', viewstate[0]['value']),
    ('__VIEWSTATEENCRYPTED',''),
    ('TextBox1', '106110006'),
    ('Button1', 'Show'),
)

这篇关于使用 BeautifulSoup 用 Python 抓取 aspx 网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-28 03:03
查看更多