前言
闲来无事,想起来之前在简书发过的文章还没搬过来,正想搬一篇lxml和re效率对比的,结果发现代码没了,索性重写一次。
先上结果:
其实这个测试结果应该没什么好纠结的,预计应该是re优于lxml,二者都优于beautifulsoup。
因为很少用beautifulsoup,所以这次没测试它。
注意事项
为了避免网络波动,测试时不应该把网络请求时间计算进去,这里使用参数传入要解析的HTML。
另外,解析语句的写法优劣会在极大程度上影响结果,所以一般工作重点应该放在表达式的写法上。
小技巧
导出list列表数据的时候直接来一个pandas,省时省力
代码
# -*- encoding: utf-8 -*-
'''
@File : test-re-lxml.py
@Time : 2021年12月18日 22:33:10 星期六
@Author : erma0
@Version : 1.0
@Link : https://erma0.cn
@Desc : 测试re lxml效率
'''
import re
import time
import pandas as pd
import requests
from lxml import etree
from itertools import zip_longest
# ahtml = requests.get('http://test.cn/').text
ahtml = '''
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<title>教育网盘</title>
<style type="text/css">
<!--
td {
font-size: 12px;
}
a:link {
text-decoration: none;
}
body {
font-size: 12px;
}
a:visited {
text-decoration: none;
}
a:hover {
color: #FF0000;
text-decoration: underline;
}
-->
</style>
</head>
<body>
<table width="90%" border="0" align="center" class="list">
<tr bgcolor=#BFE6FD height="20">
<td width="42" align=center>图标</td>
<td width="381" align=center>文件名</td>
<td width="98" align=center>所属用户</td>
<td width="85" align=center>大小</td>
<td width="132" align=center>更新时间</td>
</tr>
<div class=main_content_2 id=content>
<!-- Copyright(C) 2005-2010 All Rights Reserved. -->
<tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/pdf.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f1%c6%fb%b0%fc%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7.pdf>1汽包安装施工技术交底.pdf</a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>871.49 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7.doc>水冷壁安装施工技术交底.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%2f%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7.doc' title='查看文件 水冷壁安装施工技术交底.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>139.66 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%cb%ae%d1%b9%ca%d4%d1%e9%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc>水压试验技术交底 - 副本.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%2f%cb%ae%d1%b9%ca%d4%d1%e9%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc' title='查看文件 水压试验技术交底 - 副本.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>173.44 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%d1%cc%b5%c0%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc>烟道安装施工技术交底 - 副本.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%bc%bc%ca%f5%bd%bb%b5%d7%2f%2f%d1%cc%b5%c0%b0%b2%d7%b0%ca%a9%b9%a4%bc%bc%ca%f5%bd%bb%b5%d7+-+%b8%b1%b1%be.doc' title='查看文件 烟道安装施工技术交底 - 副本.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>130.84 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f(%d7%ee%d6%d5%a3%a9%c9%bd%ce%f7%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%a4%b3%cc%ca%a9%b9%a4%d7%e9%d6%af%c9%e8%bc%c60610.doc>(最终)山西东义干熄焦工程施工组织设计0610.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f(%d7%ee%d6%d5%a3%a9%c9%bd%ce%f7%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%a4%b3%cc%ca%a9%b9%a4%d7%e9%d6%af%c9%e8%bc%c60610.doc' title='查看文件 (最终)山西东义干熄焦工程施工组织设计0610.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>2.4 M</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%d1%b9%ca%d4%d1%e9%b7%bd%b0%b8(1)10.10.docx>东义干熄焦锅炉水压试验方案(1)10.10.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ab%d2%e5%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%d1%b9%ca%d4%d1%e9%b7%bd%b0%b8(1)10.10.docx' title='查看文件 东义干熄焦锅炉水压试验方案(1)10.10.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>103.62 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%c9%bd%ce%f7%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%b7%bd%b0%b80507.doc>山西干熄焦锅炉水冷壁安装方案0507.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%c9%bd%ce%f7%b8%c9%cf%a8%bd%b9%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%d7%b0%b7%bd%b0%b80507.doc' title='查看文件 山西干熄焦锅炉水冷壁安装方案0507.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>711.05 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ab%d2%e5%c6%fb%b0%fc%b5%f5%d7%b0%ca%a9%b9%a4%b7%bd%b0%b8+0710.docx>东义汽包吊装施工方案 0710.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ab%d2%e5%c6%fb%b0%fc%b5%f5%d7%b0%ca%a9%b9%a4%b7%bd%b0%b8+0710.docx' title='查看文件 东义汽包吊装施工方案 0710.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>989.66 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2fPQR044(12Cr2MoG%a3%ac273%a1%c113)SMAW%2bGTAW.doc>PQR044(12Cr2MoG,273×13)SMAW+GT...</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2fPQR044(12Cr2MoG%a3%ac273%a1%c113)SMAW%2bGTAW.doc' title='查看文件 PQR044(12Cr2MoG,273×13)SMAW+GTAW.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>20.24 M</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b9%f8%c2%af%ba%b8%bd%d3%ca%a9%b9%a4%b7%bd%b0%b80525.doc>锅炉焊接施工方案0525.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b9%f8%c2%af%ba%b8%bd%d3%ca%a9%b9%a4%b7%bd%b0%b80525.doc' title='查看文件 锅炉焊接施工方案0525.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>711.5 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/doc.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%c9%bd%ce%f7%b6%ab%d2%e5%b9%f8%c2%af%b8%d6%bc%dc%b0%b2%d7%b0%ca%a9%b9%a4%b7%bd%b0%b80520.doc>山西东义锅炉钢架安装施工方案0520.doc</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%c9%bd%ce%f7%b6%ab%d2%e5%b9%f8%c2%af%b8%d6%bc%dc%b0%b2%d7%b0%ca%a9%b9%a4%b7%bd%b0%b80520.doc' title='查看文件 山西东义锅炉钢架安装施工方案0520.doc ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>1.08 M</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ab%d2%e5230T%b8%c9%cf%a8%bd%b9%d3%e0%c8%c8%b9%f8%c2%af%b0%b2%d7%b0%b7%bd%b0%b8.docx>东义230T干熄焦余热锅炉安装方案.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ab%d2%e5230T%b8%c9%cf%a8%bd%b9%d3%e0%c8%c8%b9%f8%c2%af%b0%b2%d7%b0%b7%bd%b0%b8.docx' title='查看文件 东义230T干熄焦余热锅炉安装方案.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>418.04 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%b6%ac%bc%be%ca%a9%b9%a4%b7%bd%b0%b8+11.11.docx>冬季施工方案 11.11.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%bc%bc%ca%f5%2f%ca%a9%b9%a4%b7%bd%b0%b8%2f%2f%b6%ac%bc%be%ca%a9%b9%a4%b7%bd%b0%b8+11.11.docx' title='查看文件 冬季施工方案 11.11.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>47.07 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/pdf.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f1%c6%fb%b0%fc%b0%b2%d7%b0%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.pdf>1汽包安装安全技术交底.pdf</a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>908.46 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d3%e0%c8%c8%b9%f8%c2%af%c8%eb%bf%da%d1%cc%b5%c0%b0%b2%d7%b0%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx>余热锅炉入口烟道安装安全技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d3%e0%c8%c8%b9%f8%c2%af%c8%eb%bf%da%d1%cc%b5%c0%b0%b2%d7%b0%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 余热锅炉入口烟道安装安全技术交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>50.87 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d3%e0%c8%c8%b9%f8%c2%af%b8%d6%bd%e1%b9%b9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx>余热锅炉钢结构安全技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d3%e0%c8%c8%b9%f8%c2%af%b8%d6%bd%e1%b9%b9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 余热锅炉钢结构安全技术交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>50.86 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%b5%f5%d7%b0%d6%b8%bb%d3%b0%b2%c8%ab%bd%bb%b5%d7.docx>吊装指挥安全交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%b5%f5%d7%b0%d6%b8%bb%d3%b0%b2%c8%ab%bd%bb%b5%d7.docx' title='查看文件 吊装指挥安全交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>13.59 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%cb%ae%d1%b9%ca%d4%d1%e9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx>水压试验安全技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%cb%ae%d1%b9%ca%d4%d1%e9%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 水压试验安全技术交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>48.24 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d2%c6%b6%af%bd%c5%ca%d6%bc%dc%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7+-+10.31.docx>移动脚手架安全技术交底 - 10.31.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d2%c6%b6%af%bd%c5%ca%d6%bc%dc%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7+-+10.31.docx' title='查看文件 移动脚手架安全技术交底 - 10.31.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>45.41 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> <tr bgcolor=#E8F0FF height=20><td width=25 align=center><img src=http://pan.test.com/filetype/docx.gif></td><td><a href=homeapply.aspx?down=ok&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%d3%e0%c8%c8%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx>余热锅炉水冷壁安全技术交底.docx</a><a href='http://pan.test.com/show.aspx?type=4&filepath=HNAZ%2f%b0%b2%c8%ab%2f%b0%b2%c8%ab%bd%bb%b5%d7%2f%2f%d3%e0%c8%c8%b9%f8%c2%af%cb%ae%c0%e4%b1%da%b0%b2%c8%ab%bc%bc%ca%f5%bd%bb%b5%d7.docx' title='查看文件 余热锅炉水冷壁安全技术交底.docx ' target=_blank><img src=http://pan.test.com/pics/edit.gif border=0></a> <font color=#999999></font></td><td width=120 align=center>HNAZ</td><td width=120> <div lign=center>50.98 K</div></td><td width=120><div align=center>2021-11-23</div></td></tr> </div>
<tr>
<td colspan="5" align=center><div align="center">总共记录数:3276页码:<a href=http://test.com/newshare.aspx?page=10 class=page>...</a> <a href=http://test.com/newshare.aspx?page=11 class=page>11</a> <a href=http://test.com/newshare.aspx?page=12 class=page>12</a> <a href=http://test.com/newshare.aspx?page=13 class=page>13</a> <a href=http://test.com/newshare.aspx?page=14 class=page>14</a> <a href=http://test.com/newshare.aspx?page=15 class=page>15</a> <span class=10>16</span> <a href=http://test.com/newshare.aspx?page=17 class=page>17</a> <a href=http://test.com/newshare.aspx?page=18 class=page>18</a> <a href=http://test.com/newshare.aspx?page=19 class=page>19</a> <a href=http://test.com/newshare.aspx?page=20 class=page>20</a> <a href=http://test.com/newshare.aspx?page=21 class=page>...</a><a href=http://test.com/newshare.aspx?page=1>第一页</a><a href=http://test.com/newshare.aspx?page=15>上一页</a><a href=http://test.com/newshare.aspx?page=17>下一页</a><a href=http://test.com/newshare.aspx?page=164>最末页</a>第16页/共164页 </div></td>
</tr>
</table>
</body>
</html>
'''
def get_lxml(html):
d = etree.HTML(html)
link = d.xpath('//tr/td[2]/a[1]/@href')
title = d.xpath('//tr/td[2]/a[1]/text()')
passwd = d.xpath('//tr/td[2]/font/text()')
user = d.xpath('//tr/td[3][@width="120"]/text()')
size = d.xpath('//tr/td[4]/div/text()')
time = d.xpath('//tr/td[5]/div/text()')
datas = list(zip_longest(link, title, passwd, user, size, time, fillvalue=''))
datas = pd.DataFrame(datas, columns=['link', 'title', 'passwd', 'user', 'size', 'time'])
datas['link'] = datas['link'].str.strip()
datas.to_csv('result-lxml.csv')
# print(len(datas))
return datas
def get_re(html):
datas = []
# datas= r.findall(html)
datas = re.findall(rep, html, re.S)
# for data in datas: # link, title, passwd, user, time
# pass
# print(len(datas))
datas = pd.DataFrame(datas, columns=['link', 'title', 'passwd', 'user', 'size', 'time'])
datas['link'] = datas['link'].str.strip()
datas.to_csv('result-re.csv')
return datas
if __name__ == '__main__':
rep = r"<td><a href=([\s\S]*?)>([\s\S]*?)</a>[\s\S]*?<font color=#999999>([\s\S]*?)</font>[\s\S]*?center>([\s\S]*?)</td><td width=120> <div lign=center>([\s\S]*?)</div></td><td width=120><div align=center>([\s\S]*?)</div>"
r = re.compile(rep, re.S)
for name, function in [('lxml', get_lxml), ('re', get_re)]:
start = time.time()
for i in range(500):
function(ahtml)
# function(ahtml)
end = time.time()
print(name, end - start)
结果
lxml 2.1219968795776367
re 1.341965675354004
re比lxml快了接近40%
再测试一下纯解析的效率
因为上面代码中计算了pandas的数据处理时间,使用下面把它注释掉再测试一下,代码如下:
def get_lxml(html):
d = etree.HTML(html)
link = d.xpath('//tr/td[2]/a[1]/@href')
title = d.xpath('//tr/td[2]/a[1]/text()')
passwd = d.xpath('//tr/td[2]/font/text()')
user = d.xpath('//tr/td[3][@width="120"]/text()')
size = d.xpath('//tr/td[4]/div/text()')
time = d.xpath('//tr/td[5]/div/text()')
def get_re(html):
# datas= r.findall(html)
datas = re.findall(rep, html, re.S)
结果2
lxml 0.6280360221862793
re 0.15498137474060059
单纯解析加取数据,re比lxml快了300%!