问题描述
我已经搜索了很多关于BeautifulSoup有的建议LXML作为未来BeautifulSoup,而这是有道理的,我有一个艰难的时间如下表从网页上表的整个列表解析。
我与根据页面上的行数目不同,它是检查的时间感兴趣的三列。一个BeautifulSoup和lxml的解决方案非常AP preciated。这样我可以要求管理员在安装开发LXML。机器。
所需的输出:
网站上次访问上次加载
http://google.com 2011年1月14日
http://stackoverflow.com 01/10/2011
......更多,如果present
下面是一个混乱的网页中的code样品:
<表边框=2WIDTH =100%>
<&TBODY GT;< TR>
< TD WIDTH =33%级=BoldTD>网站与LT; / TD>
< TD WIDTH =33%级=BoldTD>上次访问< / TD>
< TD WIDTH =34%级=BoldTD>最后加载< / TD>
< / TR>
&所述; TR>
< TD WIDTH =33%>
&所述; A HREF =http://google.com&下; / A>
< / TD>
< TD WIDTH =33%> 2011年1月14日
< / TD>
< TD WIDTH =34%>
< / TD>
< / TR>
&所述; TR>
< TD WIDTH =33%>
&所述; A HREF =http://stackoverflow.com&下; / A>
< / TD>
< TD WIDTH =33%> 01/10/2011
< / TD>
< TD WIDTH =34%>
< / TD>
< / TR>
< / TBODY>< /表>
下面是一个使用HTMLParser的一个版本。我试着对的内容。对付它的meta标签和DOCTYPE声明,这两者挫败了ElementTree的版本。
从进口的HTMLParser的HTMLParser类MyParser(HTMLParser的):
高清__init __(个体经营):
HTMLParser的.__的init __(个体经营)
self.line =
self.in_tr =假
self.in_table =假 高清handle_starttag(个体经营,标签,ATTRS):
如果self.in_table和标签==TR:
self.line =
self.in_tr = TRUE
如果标签=='一':
在ATTRS ATTR:
如果ATTR [0] =='href属性:
self.line + = ATTR [1] + 高清handle_endtag(个体经营,标签):
如果标签=='TR':
self.in_tr =假
如果len(self.line):
打印self.line
ELIF标签==表:
self.in_table =假 高清handle_data(个体经营,数据):
如果数据==网站:
self.in_table = 1
ELIF self.in_tr:
数据= data.strip()
如果数据:
self.line + = data.strip()+如果__name__ =='__main__':
MYP = MyParser()
myp.feed(开('table.html')。阅读())
希望这解决了你需要的一切,你能接受这个作为答案。
按要求更新。
I have searched a lot about BeautifulSoup and some suggested lxml as the future of BeautifulSoup while that makes sense, I am having a tough time parsing the following table from a whole list of tables on the webpage.
I am interested in the three columns with varied number of rows depending on the page and the time it was checked. A BeautifulSoup and lxml solution is well appreciated. That way I can ask the admin to install lxml on the dev. machine.
Desired output :
Website Last Visited Last Loaded
http://google.com 01/14/2011
http://stackoverflow.com 01/10/2011
...... more if present
Following is a code sample from a messy web page :
<table border="2" width="100%">
<tbody><tr>
<td width="33%" class="BoldTD">Website</td>
<td width="33%" class="BoldTD">Last Visited</td>
<td width="34%" class="BoldTD">Last Loaded</td>
</tr>
<tr>
<td width="33%">
<a href="http://google.com"</a>
</td>
<td width="33%">01/14/2011
</td>
<td width="34%">
</td>
</tr>
<tr>
<td width="33%">
<a href="http://stackoverflow.com"</a>
</td>
<td width="33%">01/10/2011
</td>
<td width="34%">
</td>
</tr>
</tbody></table>
Here's a version that uses HTMLParser. I tried against the contents of pastebin.com/tu7dfeRJ. It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.
from HTMLParser import HTMLParser
class MyParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.line = ""
self.in_tr = False
self.in_table = False
def handle_starttag(self, tag, attrs):
if self.in_table and tag == "tr":
self.line = ""
self.in_tr = True
if tag=='a':
for attr in attrs:
if attr[0] == 'href':
self.line += attr[1] + " "
def handle_endtag(self, tag):
if tag == 'tr':
self.in_tr = False
if len(self.line):
print self.line
elif tag == "table":
self.in_table = False
def handle_data(self, data):
if data == "Website":
self.in_table = 1
elif self.in_tr:
data = data.strip()
if data:
self.line += data.strip() + " "
if __name__ == '__main__':
myp = MyParser()
myp.feed(open('table.html').read())
Hopefully this addresses everything you need and you can accept this as the answer.Updated as requested.
这篇关于请帮助使用BeautifulSoup解析这个HTML表并限于lxml的Python的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!