req = Request(website, headers={ 'User-Agent': 'Mozilla/5.0' })
base64string = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
req.add_header("Authorization", "Basic %s" % base64string)
readweb = urlopen(req).read()
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, theurl, username, password)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
pagehandle = opener.open(theurl)
return pagehandle.read()
r = requests.session()
r.post(theurl, data={'username' : 'username', 'password' : 'password', 'remember':'1'})
print('Sorry, Unable to...')
result = r.get(theurl)
return result.text
Traceback (most recent call last):
Traceback (most recent call last):
cookies = {'PHPSESSID':'5udcifi6p43ma3h1fnpfqghiu0'}
result = sess.get(the_url, cookies=cookies)
我只想先检查一下我的请求是否有问题,然后再按上面的SO链接中Martijn Pieters的建议探索BeautifulSoup/robobrowser.
<form name="aspnetForm" method="post" action="" id="aspnetForm">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATEFIELDCOUNT" id="__VIEWSTATEFIELDCOUNT" value="2" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTkwNzg1NTQ3OA9kFgJmD2QWAmYPZBYGAgetc." />
<input type="hidden" name="__VIEWSTATE1" id="__VIEWSTATE1" value="ZyBBIEhvbWUVIE5lZ290aWF0ZSBBZ3JlZW1lbnRzEiBSZetc." />
<script type="text/javascript">
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
<script src="/WebResource.axd?d=t2SAOwDGkbrEfkmUaMOR9sPLXqgxfeenNayRja3DNK2R8JEcH-StTTuiaqXpzp--PAISn3vzVbWQ7biREwPkibCmbAE1&t=635586505120000000" type="text/javascript"></script>
<script src="/ScriptResource.axd?d=EL6tXtJfNfGSoQwhYtVnYEqw4oKvuwBBI4etc." type="text/javascript"></script>
<script type="text/javascript">
if (typeof(Sys) === 'undefined') throw new Error('ASP.NET Ajax client-side framework failed to load.');
<script src="/ScriptResource.axd?d=qCmNMcECQa0tfmMcZdwJeeOdcyetc." type="text/javascript"></script>
<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="FC5C7135" />
<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEdABB2xJRvPLCcg6GsBqRFCtw6Xg91QEu10etc." />
我的用户/通过"术语是否必须与源代码匹配,即用户名=用户名或用户?:我现在在html的位置找不到了,但是找到了'ctl00 $ cphMain $ tbUsername'和'ctl00 $ cphMain $ tbPassword'...
我是否需要将密码和/或用户名作为base64.encodestring发送?(我不知道这是否有问题,但是密码包含字符,例如!@ $等.)
ASP.NET_SessionId,CFID,CFTOKEN,__ atuvc,__ utma,__ utmb,__ utmc,__ utmt,__ utmz,BRO_CALLME,BRO_ID,BRO_LOGIN,BRO_MEMBER,BROAUTH,ISFULLMEMBER,phpMBLink,__ CT_Data,WRUID
- 有网站(www.website.com),登录页面(www.website.com/login)和内容(www.website.com/content).我以为我使用了(成功登录的)登录页面中的cookie并将其发送"到内容页面中是否正确?我应该手动执行此操作(从浏览器Cookie信息中输入字段详细信息)还是在代码中执行此操作(因此,在下面的代码中,我将使用:cookies = r_login.cookies)?
import requests
the_url = 'the_url'
login = the_url + '/login'
content = the_url + '/content'
username = 'username'
password = 'password'
sess = requests.Session()
sess.auth = ('username', 'password')
payload = {'ctl00$cphMain$tbUsername': username, 'ctl00$cphMain$tbPassword': password}
r_login = sess.post(login, data=payload)
cookies = {'PHPSESSID':'5udcifi6p43ma3h1fnpfqghiu0', 'ASP.NET_SessionId':'aspnet', 'BRO_LOGIN':'bro_login'}
r_data = s.get(content, cookies=cookies, data=payload)
print r_data.text
好,非常感谢Prashant和barny的回应,并非常感谢Martijn Pieters通过这篇文章:使用Python的请求发送ASP.net POST
我发现我的救赎是 robot .
from robobrowser import RoboBrowser
the_url = 'the_url'
login = the_url + '/login'
content = the_url + '/content'
username = 'username'
password = 'password'
browser = RoboBrowser(parser='lxml')
form = browser.get_forms()
# You can use '.get_form()' for a specific form but I'm finding it easier to
# using '.get_forms()' to get all the forms and then I'm just interested
# in the first one:
form = form[0]
print form # this will give you the information you need to
# now enter your password details:
form['the_user'].value = username
form['the_pass'].value = password
# and then because I'm after the html of certain content pages:
source = str(browser.parsed)
return source
Though I'm not particularly advanced at any of this, I've had some past success in using urrlib2, requests and scrapy but this has me stumped. So after much searching and banging my head against the keyboard, I'll just go ahead and ask.
I'd like to get the html source code of a site but after using my username and password, I keep getting a page thrown back which says my username and password are wrong. They work fine in the browser, and once logged in the source code is readily available (via browser). But I can't seem to achieve the same result via python/terminal. I'll include some of my attempts (gleamed from the these helpful pages) below:
using urllib2:
req = Request(website, headers={ 'User-Agent': 'Mozilla/5.0' })
base64string = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
req.add_header("Authorization", "Basic %s" % base64string)
readweb = urlopen(req).read()
another version:
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, theurl, username, password)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
pagehandle = opener.open(theurl)
return pagehandle.read()
and an attempt using requests:
r = requests.session()
r.post(theurl, data={'username' : 'username', 'password' : 'password', 'remember':'1'})
print('Sorry, Unable to...')
result = r.get(theurl)
return result.text
I've also tried to use scrapy, but regardless of which library I use it comes back with the html of a page which says my password/details are wrong. I'm guessing it's something to do with the headers/authorisation(?) I'm sending, but I'm not overly sure. Any help much appreciated, please let me know what other details I can update with (I've been up half the night with this, so if this post doesn't make sense please forgive me!)
Here's the traceback response to Prashant's answer below (minus the passwords etc.):
Traceback (most recent call last):
Ok, I'm now using mechanize (recommended below), and here's what I'm getting back (not sure if this is another instance of my root problem or my inability with mechanize!):
Traceback (most recent call last):
Still struggling with this, so here's a last ditch effort before time runs out on this project and I have to go in and get all the html manually! Fingers crossed..
Ok, so on the advice of barny, I'm back to using requests, and I'm attempting to provide the post with cookie information that I've gleamed from a successful browser login. I'm not certain I'm doing this correctly, but I'm using:
cookies = {'PHPSESSID':'5udcifi6p43ma3h1fnpfqghiu0'}
result = sess.get(the_url, cookies=cookies)
Now, at the moment, I'm getting an Internal Server Error response. After some research, aspnet forms seems to be the problem:
I just want to check that I'm not doing something wrong with requests first, then perhaps I'll explore BeautifulSoup/robobrowser as recommended by Martijn Pieters in the SO link above.
Here's what the form section of the html is asking:
<form name="aspnetForm" method="post" action="" id="aspnetForm">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATEFIELDCOUNT" id="__VIEWSTATEFIELDCOUNT" value="2" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTkwNzg1NTQ3OA9kFgJmD2QWAmYPZBYGAgetc." />
<input type="hidden" name="__VIEWSTATE1" id="__VIEWSTATE1" value="ZyBBIEhvbWUVIE5lZ290aWF0ZSBBZ3JlZW1lbnRzEiBSZetc." />
<script type="text/javascript">
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
<script src="/WebResource.axd?d=t2SAOwDGkbrEfkmUaMOR9sPLXqgxfeenNayRja3DNK2R8JEcH-StTTuiaqXpzp--PAISn3vzVbWQ7biREwPkibCmbAE1&t=635586505120000000" type="text/javascript"></script>
<script src="/ScriptResource.axd?d=EL6tXtJfNfGSoQwhYtVnYEqw4oKvuwBBI4etc." type="text/javascript"></script>
<script type="text/javascript">
if (typeof(Sys) === 'undefined') throw new Error('ASP.NET Ajax client-side framework failed to load.');
<script src="/ScriptResource.axd?d=qCmNMcECQa0tfmMcZdwJeeOdcyetc." type="text/javascript"></script>
<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="FC5C7135" />
<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEdABB2xJRvPLCcg6GsBqRFCtw6Xg91QEu10etc." />
So. Some small questions.
Does my 'user/pass' terminology have to match the source code i.e username = username or user?:I've lost where I found this in the html now, but I found 'ctl00$cphMain$tbUsername' and 'ctl00$cphMain$tbPassword'…
Do I need to send the password and/or username as a base64.encodestring?(I don't know if this is a problem, but the password contains chars such as !@$ etc.)
Do I need to add ALL of the cookie fields I've found from the browser or just the PHPSESSID? Here are the fields I've got in the cookies:
ASP.NET_SessionId, CFID, CFTOKEN, __atuvc, __utma, __utmb, __utmc, __utmt, __utmz, BRO_CALLME, BRO_ID, BRO_LOGIN, BRO_MEMBER, BROAUTH, ISFULLMEMBER, phpMBLink, __CT_Data, WRUID
- There is the website (www.website.com), the login-page (www.website.com/login), and then the content (www.website.com/content). Am I correct in thinking I use the cookie from the (successfully logged in) login-page and 'send' it to the content page? Should I do this manually (enter field details from browser cookie information) or within the code (so, in code below I would use: cookies = r_login.cookies)?
Finally, here's the code I'm currently using that returns an Internal Server Error..:
import requests
the_url = 'the_url'
login = the_url + '/login'
content = the_url + '/content'
username = 'username'
password = 'password'
sess = requests.Session()
sess.auth = ('username', 'password')
payload = {'ctl00$cphMain$tbUsername': username, 'ctl00$cphMain$tbPassword': password}
r_login = sess.post(login, data=payload)
cookies = {'PHPSESSID':'5udcifi6p43ma3h1fnpfqghiu0', 'ASP.NET_SessionId':'aspnet', 'BRO_LOGIN':'bro_login'}
r_data = s.get(content, cookies=cookies, data=payload)
print r_data.text
Apologies, this has gotten rather long now, if I need to split it up over several posts please let me know - what I assumed was a simple question at the outset has mutated into something else!
Ok, with thanks to Prashant and barny for their responses, and a big thanks to Martijn Pieters via this post:Sending an ASP.net POST with Python's Requests
I've found my salvation to berobobot.
Here's the code:
from robobrowser import RoboBrowser
the_url = 'the_url'
login = the_url + '/login'
content = the_url + '/content'
username = 'username'
password = 'password'
browser = RoboBrowser(parser='lxml')
form = browser.get_forms()
# You can use '.get_form()' for a specific form but I'm finding it easier to
# using '.get_forms()' to get all the forms and then I'm just interested
# in the first one:
form = form[0]
print form # this will give you the information you need to
# now enter your password details:
form['the_user'].value = username
form['the_pass'].value = password
# and then because I'm after the html of certain content pages:
source = str(browser.parsed)
return source