一、urllib、urllib2、urllib3、requests
urllib2室python2中的,python3合并了urllib和urllib2到urllib目录下,所以python3直接使用urllib。
urllib3是一个三方库,它提供了连接池、客户端SSL/TLS验证、文件编码上传、HTTP重定向、gzip和deflate压缩编码、HTTP和SOCKS代理等功能。
requests也是一个三方库,它依赖于urllib3,做了一些封装,所以一般使用requests的比较多。
二、urlopen
from urllib import request,parse
response = request.urlopen(r'http://www.baidu.com', timeout=3000)
# <class 'http.client.HTTPResponse'>
print(type(response))
content = response.read()
# <class 'bytes'>
print(type(content))
print(content.decode('utf-8'))
# 传递参数
param = parse.urlencode({'id': '2'})
data = bytes(param, encoding='utf8')
response = request.urlopen(r'http://www.baidu.com', data=data)
urlopen的timeout可以设置超时时间,data可以设置参数。
urlencode是把参数编码为url参数:
param = parse.urlencode({'id': '2', 'name': '中文'}, encoding='utf-8')
# id=2&name=%E4%B8%AD%E6%96%87
print(param)
# %E4%B8%AD%E6%96%87
print(parse.quote("中文"))
print(parse.unquote("%E4%B8%AD%E6%96%87"))
三、Response
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com', timeout=3000)
# 获取网页内容
print(response.read().decode('utf-8'))
# 获取指定header
print(response.getheader('Content-Type'))
# 以元组列表获取头信息
print(response.getheaders())
# 获取版本信息
print(response.version)
# 获取状态码
print(response.status)
# 获取调试等级
print(response.debuglevel)
# 获取对象是否关闭布尔值
print(response.closed)
# 获取URL
print(response.geturl())
# 获取HTTP状态码
print(response.getcode())
# 获取msg
print(response.msg)
# 获取状态信息
print(response.reason)
# 获取其他信息
print(response.info())
四、Request
from urllib import request, parse
url = 'http://127.0.0.1:8080/test/user'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'
}
data = {'id': '1', 'name': 'tim'}
params = parse.urlencode(data)
byte_params = bytes(params, encoding='utf-8')
rst = request.Request(url=url, data=byte_params, headers=headers, method='POST')
rst.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
rst.add_header('Accept-Encoding', 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2')
rst.add_header('Accept-Language', 'gzip, deflate, br')
response = request.urlopen(rst)
print(response.read().decode('utf-8'))
五、异常
URLError在urllib库的error模块,继承了OSError类,由request模块产生的异常都可以通过捕获这个类来处理,URLError包含一个属性reason表示错误原因。
HTTPError是URLError的子类,有3个属性,code表示HTTP状态码,reason表示错误原因,headers是返回头信息。
from urllib import request,error
url = 'http://127.0.0.1:8080/test/user'
try:
response = request.urlopen(url, timeout=1)
except error.HTTPError as e:
print(e.reason, e.code, e.headers)
print("HTTPError:" + str(type(e)))
except error.URLError as e:
print(e.reason)
print("URLError:" + str(type(e)))
else:
print('success')
六、urllib handler处理流程
七、cookie
7.1 获取cookie
from http import cookiejar
from urllib import request
url = 'http://127.0.0.1:8080/test/cookie'
cookie = cookiejar.CookieJar()
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open(url)
print(response.read().decode('utf-8'))
for ck in cookie:
print(ck.name + ":" + ck.value)
7.2 cookie保存与重用
from http import cookiejar
from urllib import request
url = 'http://127.0.0.1:8080/test/cookie'
fielname = r'F:\tmp\cookies.txt'
# cookie = cookiejar.MozillaCookieJar(filename=fielname)
cookie = cookiejar.LWPCookieJar(filename=fielname)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open(url)
print(response.read().decode('utf-8'))
cookie.save(ignore_discard=True, ignore_expires=True)
# cookie = cookiejar.MozillaCookieJar()
cookie = cookiejar.LWPCookieJar()
cookie.load(fielname, ignore_discard=True, ignore_expires=True)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open(url)
print(response.read().decode('utf-8'))
7.3 服务端代码
@RequestMapping("/cookie")
public String cookie(HttpServletRequest request,
HttpServletResponse response,
@CookieValue(value = "pyck", required = false,defaultValue = "dfck") String pyck
){
Cookie[] cookies = request.getCookies();
if(cookies != null){
for(Cookie cookie : cookies){
System.out.println(cookie.getName() + " " + cookie.getValue());
}
}
Cookie cookie=new Cookie("pyck","happy");
response.addCookie(cookie);
System.out.println("pyck:" + pyck);
return pyck;
}
八、代理
from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
proxy = ProxyHandler({
'http': 'http://127.0.0.1:7777',
'https': 'http://127.0.0.1:8888'
})
opener = build_opener(proxy)
try:
response=opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))
except URLError as e:
print(e.reason)
九、Auth
这里的auth是指HTTPBasicAuth,HTTPBasicAuth一般是服务器实现的,直接配置的用户密码和权限,不是我们常见的登录,因为一般我们都是自己实现登录。
不过我们还是有必要了解一下HTTPBasicAuth,很多的监控组件不会自己实现登录注册,就会简单的使用服务器提供的HTTPBasicAuth,例如Tomcat的监控。
下面就介绍一下python中利用HTTPBasicAuth,先下载Tomcat,然后tomcat根目录下conf目录下的tomcat-users.xml,tomcat-users节点下添加:
<role rolename="admin-gui"/>
<role rolename="manager-gui"/>
<role rolename="manager-jmx"/>
<role rolename="manager-script"/>
<role rolename="manager-status"/>
<user username="tim" password="123456" roles="admin-gui,manager-gui,manager-jmx,manager-script,manager-status"/>
在tomcat的bin目录下执行startup脚本就可以启动
from urllib.request import HTTPPasswordMgrWithDefaultRealm
from urllib.request import HTTPBasicAuthHandler
from urllib.request import build_opener
from urllib import request, error
username = 'tim'
password = '123456'
url = 'http://localhost:8080/manager/status'
pwdMg = HTTPPasswordMgrWithDefaultRealm()
pwdMg.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(pwdMg)
opener = build_opener(auth_handler)
try:
response = opener.open(url)
html = response.read().decode('utf8')
print(html)
except error.URLError as e:
print(e.reason)
# 没有auth,401
try:
response = request.urlopen(url)
except error.HTTPError as e:
print(e.reason, e.code, e.headers)
else:
print('success')