问题描述
我想提取某些信息从一个html文件。例如。它包含一个表
像这样(与其他内容的其他表中):
<表类=细节>
&所述; TR>
<第i咨询:LT; /第i
< TD> RHBA-2013:0947-1< / TD>
< / TR>
&所述; TR>
<第i个类型:其中,/第i
< TD>的Bug修复谘询及LT; / TD>
< / TR>
&所述; TR>
<第i严重性:LT; /第i
< TD> N / A< / TD>
< / TR>
&所述; TR>
<第i发行时间:< /第i
< TD> 2013年6月13日< / TD>
< / TR>
&所述; TR>
<第i最后更新:LT; /第i
< TD> 2013年6月13日< / TD>
< / TR> &所述; TR>
百分位VALIGN =顶>受影响的产品:其中; /第i
< TD>< A HREF =#红帽企业Linux ELS(4节)>红帽企业Linux ELS&LT(4节); / A>< / TD>
< / TR>
< /表>
我想提取喜欢的最新信息发布的关于。它看起来像BeautifulSoup4
能做到这一点easyly,但不知何故,我不设法得到它的权利。
我的code迄今:
从BS4进口BeautifulSoup
汤= BeautifulSoup(UNI codestring_containing_the_entire_htlm_doc)
table_tag = soup.table
如果table_tag ['类'] == ['细节']:
打印table_tag.tr.th.get_text()++ table_tag.tr.td.get_text()
A = table_tag.next_sibling
打印UNI code(一)
打印table_tag.contents
这让我第一个表行的内容,也是内容的列表。
但接下来的事情兄弟不正确的工作,我想我只是用错了。
当然,我可能只是解析内容啄,但在我看来,美丽的汤
被设计为$ P $的正是这一点做(如果我开始分析自己pvent我们,我还不如
还有分析整个文档...)。如果有人能启发我如何acomplish这一点,我
将gratefull。如果有更好的方法,然后BeautifulSoup,我有兴趣
听到这个消息。
>>>从BS4进口BeautifulSoup
>>>汤= BeautifulSoup(UNI codestring_containing_the_entire_htlm_doc)
>>>表= soup.find('表',{'类':'细节'})
>>> TH = table.find('日',文本='上发布:)
>>>日
<第i发行时间:< /第i
>>> TD = th.findNext('TD')
>>> TD
< TD> 2013年6月13日< / TD>
>>> td.text
u'2013-06-13
I want to extract certain information out of an html document. E.g. it contains a table(among other tables with other contents) like this:
<table class="details">
<tr>
<th>Advisory:</th>
<td>RHBA-2013:0947-1</td>
</tr>
<tr>
<th>Type:</th>
<td>Bug Fix Advisory</td>
</tr>
<tr>
<th>Severity:</th>
<td>N/A</td>
</tr>
<tr>
<th>Issued on:</th>
<td>2013-06-13</td>
</tr>
<tr>
<th>Last updated on:</th>
<td>2013-06-13</td>
</tr>
<tr>
<th valign="top">Affected Products:</th>
<td><a href="#Red Hat Enterprise Linux ELS (v. 4)">Red Hat Enterprise Linux ELS (v. 4)</a></td>
</tr>
</table>
I want to extract Information like the date of "Issued on:". It looks like BeautifulSoup4could do this easyly, but somehow I don't manage to get it right.My code so far:
from bs4 import BeautifulSoup
soup=BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
table_tag=soup.table
if table_tag['class'] == ['details']:
print table_tag.tr.th.get_text() + " " + table_tag.tr.td.get_text()
a=table_tag.next_sibling
print unicode(a)
print table_tag.contents
This gets me the contents of the first table row, and also a listing of the contents.But the next sibling thing is not working right, I guess I am just using it wrong.Of course I could just parse the contents thingy, but it seems to me that beautiful soupwas designed to prevent us from doing exactly this (if I start parsing myself, I might aswell parse the whole doc ...). If someone could enlighten me on how to acomplish this, Iwould be gratefull. If there is a better way then BeautifulSoup, I would be interested tohear about it.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
>>> table = soup.find('table', {'class': 'details'})
>>> th = table.find('th', text='Issued on:')
>>> th
<th>Issued on:</th>
>>> td = th.findNext('td')
>>> td
<td>2013-06-13</td>
>>> td.text
u'2013-06-13'
这篇关于从Python和BeautifulSoup HTML提取表内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!