本文介绍了HTML 表到 Pandas 表:html 标签内的信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自网络的大表,通过请求访问并用 BeautifulSoup 解析.它的一部分看起来像这样:

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:

<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>

当我使用 pd.read_html(tbl) 将其转换为 pandas 时,输出如下所示:

When I convert this to pandas using pd.read_html(tbl) the output is like this:

    0    1          2
 0  265  JonesBlue  29
 1  266  Smith      34

我需要将信息保存在 <A HREF ... > 标签中,因为唯一标识符存储在链接中.也就是说,表格应该是这样的:

I need to keep the information in the <A HREF ... > tag, since the unique identifier is stored in the link. That is, the table should look like this:

    0    1        2
 0  265  jones03  29
 1  266  smith01  34

我可以接受各种其他输出(例如,jones03 Jones 会更有帮助),但唯一 ID 至关重要.

I'm fine with various other outputs (for example, jones03 Jones would be even more helpful) but the unique ID is critical.

其他单元格中也有 html 标签,通常我不希望保存这些标签,但如果这是获取 uid 的唯一方法,我可以保留这些标签并稍后清理它们,如果我必须.

Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.

是否有一种简单的方法可以访问这些信息?

Is there a simple way of accessing this information?

推荐答案

由于这个解析工作需要提取文本和属性值,它不能完全开箱即用"由诸如pd.read_html.其中一些必须手工完成.

Since this parsing job requires the extraction of both text and attributevalues, it can not be done entirely "out-of-the-box" by a function such aspd.read_html. Some of it has to be done by hand.

使用 lxml,您可以使用 XPath 提取属性值:

Using lxml, you could extract the attribute values with XPath:

import lxml.html as LH
import pandas as pd

content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''

table = LH.fromstring(content)
for df in pd.read_html(content):
    df['refname'] = table.xpath('//tr/td/a/@href')
    df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
    print(df)

收益

     0          1   2  refname
0  265  JonesBlue  29  jones03
1  266      Smith  34  smith01

以上可能很有用,因为它只需要几个用于添加 refname 列的额外代码行.

但是 LH.fromstringpd.read_html 都解析 HTML.所以它的效率可以通过删除 pd.read_html 和使用 LH.fromstring 解析表一次:

But both LH.fromstring and pd.read_html parse the HTML.So it's efficiency could be improved by removing pd.read_html andparsing the table once with LH.fromstring:

table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')]
        for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
    df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)

收益

    id        name  val  refname
0  265   JonesBlue   29  jones03
1  266       Smith   34  smith01

这篇关于HTML 表到 Pandas 表:html 标签内的信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 07:41