问题描述
当HTML表的表头没有<thead>
元素时,我想检测该表头. (驱动维基百科的MediaWiki,不支持<thead>
元素 .)我想在BeautifulSoup和lxml中都使用python来做到这一点.假设我已经有一个table
对象,并且想从中删除一个thead
对象,一个tbody
对象和一个tfoot
对象.
I'd like to detect the header of an HTML table when that table does not have <thead>
elements. (MediaWiki, which drives Wikipedia, does not support <thead>
elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table
object and I'd like to get out of it a thead
object, a tbody
object, and a tfoot
object.
当前,当存在<thead>
标记时,parse_thead
会执行以下操作:
Currently, parse_thead
does the following when the <thead>
tag is present:
- 在BeautifulSoup中,我使用
doc.find_all('table')
获取表对象,并且可以使用table.find_all('thead')
- 在lxml中,我在
//table
的xpath_expr上使用doc.xpath()
获取表对象,并且可以使用table.xpath('.//thead')
- In BeautifulSoup, I get table objects with
doc.find_all('table')
and I can usetable.find_all('thead')
- In lxml, I get table objects with
doc.xpath()
on an xpath_expr on//table
, and I can usetable.xpath('.//thead')
和parse_tbody
和parse_tfoot
以相同的方式工作. (我没有编写此代码,并且对BS或lxml都不熟悉.)但是,如果没有<thead>
,则parse_thead
不返回任何内容,而parse_tbody
一起返回标头和正文.
and parse_tbody
and parse_tfoot
work in the same way. (I did not write this code and I am not experienced with either BS or lxml.) However, without a <thead>
, parse_thead
returns nothing and parse_tbody
returns the header and the body together.
我在下面附加一个 Wikitable实例.它缺少<thead>
和<tbody>
.而是将所有行(无论是否包含标题)都包含在<tr>...</tr>
中,但是标题行具有<th>
元素,主体行具有<td>
元素.如果没有<thead>
,似乎识别标头的正确标准是从头开始,将行放入标头中,直到找到具有非<th>
元素的行."
I append a wikitable instance below as an example. It lacks <thead>
and <tbody>
. Instead all rows, header or not, are enclosed in <tr>...</tr>
, but header rows have <th>
elements and body rows have <td>
elements. Without <thead>
, it seems like the right criterion for identifying the header is "from the start, put rows into the header until you find a row that has an element that's not <th>
".
我很高兴提出有关如何编写parse_thead
和parse_tbody
的建议.没有足够的经验,我想我可以
I'd appreciate suggestions on how I could write parse_thead
and parse_tbody
. Without much experience here, I would think I could either
- 潜入表对象并在解析之前手动插入
thead
和tbody
标记(这看起来不错,因为这样我就不必更改任何其他可使用<thead>
识别表的代码),或者交替地 - 更改
parse_thead
和parse_tbody
以识别仅具有<th>
元素的表行. (无论选择哪种方法,似乎我真的需要以这种方式检测头部-身体的边界.)
- Dive into the table object and manually insert
thead
andtbody
tags before parsing it (this seems nice because then I wouldn't have to change any other code that recognizes tables with<thead>
), or alternately - Change
parse_thead
andparse_tbody
to recognize the table rows that have only<th>
elements. (With either alternative, it seems like I really need to detect the head-body boundary in this way.)
我不知道该怎么做,我很乐意就更明智的选择以及我可能如何做这两个方面提出建议.
I don't know how to do either of those things and I'd appreciate advice on both which alternative is more sensible and how I might go about it.
(带有没有标题行和多个标头行.我不能认为它只有一个标头行.)
( Examples with no header rows and multiple header rows. I can't assume it has only one header row.)
<table class="wikitable">
<tr>
<th>Rank</th>
<th>Score</th>
<th>Overs</th>
<th><b>Ext</b></th>
<th>b</th>
<th>lb</th>
<th>w</th>
<th>nb</th>
<th>Opposition</th>
<th>Ground</th>
<th>Match Date</th>
</tr>
<tr>
<td>1</td>
<td>437</td>
<td>136.0</td>
<td><b>64</b></td>
<td>18</td>
<td>11</td>
<td>1</td>
<td>34</td>
<td>v West Indies</td>
<td>Manchester</td>
<td>27 Jul 1995</td>
</tr>
</table>
推荐答案
在表不包含<thead>
标记的情况下,我们可以使用<th>
标记来检测标头.如果一行的所有列都是<th>
标记,那么我们可以假定它是标题.基于此,我创建了一个用于识别标题和正文的函数.
We can use <th>
tags to detect headers, in case the table doesn't contain <thead>
tags. If all columns of a row are <th>
tags then we can assume that it is a header. Based on that I created a function that identifies the header and body.
BeautifulSoup
的代码:
def parse_table(table):
head_body = {'head':[], 'body':[]}
for tr in table.select('tr'):
if all(t.name == 'th' for t in tr.find_all(recursive=False)):
head_body['head'] += [tr]
else:
head_body['body'] += [tr]
return head_body
lxml
的代码:
def parse_table(table):
head_body = {'head':[], 'body':[]}
for tr in table.cssselect('tr'):
if all(t.tag == 'th' for t in tr.getchildren()):
head_body['head'] += [tr]
else:
head_body['body'] += [tr]
return head_body
table
参数是Beautiful Soup Tag对象或lxml Element对象. head_body
是一本字典,包含两个<tr>
标签列表,标题行和正文行.
The table
parameter is either a Beautiful Soup Tag object or a lxml Element object. head_body
is a dictionary that contains two lists of <tr>
tags, the header and body rows.
用法示例:
html = '<table><tr><th>heade</th></tr><tr><td>body</td></tr></table>'
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
table_rows = parse_table(table)
print(table_rows)
#{'head': [<tr><th>header</th></tr>], 'body': [<tr><td>body</td></tr>]}
这篇关于当表格缺少thead元素时,使用beautifulsoup/lxml在HTML表格中检测标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!