磕磕绊绊学python一个月,这次到正则表达式终于能写点有趣的东西,在此作个记录:

—————————————————————————————————————————————————

1.爬取豆瓣电影榜前250名单

运行环境:

pycharm-professional-2018.2.4

3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)]

成品效果:

记一次简单爬虫(豆瓣/dytt)-LMLPHP

记一次简单爬虫(豆瓣/dytt)-LMLPHP

相关代码:

 from urllib.request import urlopen
import re
# import ssl # 若有数字签名问题可用
# ssl._create_default_https_context = ssl._create_unverified_context # 写正则规则
obj = re.compile(r'<div class="item">.*?<span class="title">(?P<name>.*?)</span>.*?导演:(?P<daoyan>.*?)&nbsp;.*?'
r'主演:(?P<zhuyan>.*?)<br>\n (?P<shijian>.*?)&nbsp;/&nbsp;(?P<diqu>.*?)&nbsp;'
r'/&nbsp;(?P<leixing>.*?)\n.*?<span class="rating_num" property="v:average">(?P<fen>.*?)</span>.*?<span>'
r'(?P<renshu>.*?)评价</span>.*?<span class="inq">(?P<jianping>.*?)</span>',re.S) # re.S 干掉换行 # 转码 获取内容
def getContent(url):
content = urlopen(url).read().decode("utf-8")
return content # 匹配页面内容 返回一个迭代器
def parseContent(content):
iiter = obj.finditer(content)
for el in iiter:
yield {
"name":el.group("name"),
"daoyan":el.group("daoyan"),
"zhuyan":el.group("zhuyan"),
"shijian":el.group("shijian"),
"diqu":el.group("diqu"),
"leixing":el.group("leixing"),
"fen":el.group("fen"),
"renshu":el.group("renshu"),
"jianping":el.group("jianping")
} for i in range(10):
url = "https://movie.douban.com/top250?start=%s&filter=" % (i*25) # 循环页面10
print(url)
g = parseContent(getContent(url)) # 匹配获取的内容返回给g
f = open("douban_movie.txt",mode="a",encoding="utf-8")
for el in g:
f.write(str(el)+"\n") # 写入到txt 注意加上换行 # f.write("==============================================") # 测试分页
f.close()

2.爬取某站最新电影和下载地址

运行环境:

pycharm-professional-2018.2.4

3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)]

成品效果:

记一次简单爬虫(豆瓣/dytt)-LMLPHP

相关代码:

 from urllib.request import urlopen
import json
import re # 获取主页面内容
url = "https://www.dytt8.net/"
content = urlopen(url).read().decode("gbk")
# print(content) # 正则
obj = re.compile(r'.*?最新电影下载</a>]<a href=\'(?P<url1>.*?)\'>', re.S)
obj1 = re.compile(r'.*?<div id="Zoom">.*?<br />◎片  名(?P<name>.*?)<br />.*?bgcolor="#fdfddf"><a href="(?P<download>.*?)">', re.S) def get_content(content):
res = obj.finditer(content)
f = open('movie_dytt.json', mode='w', encoding='utf-8')
for el in res:
res = el.group("url1")
res = url + res # 拼接子页面网址 content1 = urlopen(res).read().decode("gbk") # 获取子页面内容
lst = obj1.findall(content1) # 匹配obj1返回一个列表
# print(lst) # 元组
name = lst[0][0]
download = lst[0][1]
s = json.dumps({"name":name,"download":download},ensure_ascii=False)
f.write(s+"\n")
f.flush()
f.close() get_content(content) # 调用函数 执行
05-11 11:03