本文介绍了如何使用python pandas的read_html读取具有多个主体的html表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
这是我的html:
import pandas as pd
html_table = '''<table>
<thead>
<tr><th>Col1</th><th>Col2</th>
</thead>
<tbody>
<tr><td>1a</td><td>2a</td></tr>
</tbody>
<tbody>
<tr><td>1b</td><td>2b</td></tr>
</tbody>
</table>'''
如果我运行df = pd.read_html(html_table)
,然后运行print(df[0]
,我将得到:
If I run df = pd.read_html(html_table)
, and then print(df[0]
I get:
Col1 Col2
0 1a 2a
颜色2消失.为什么?如何预防呢?
Col 2 disappears. Why? How to prevent it?
推荐答案
您发布的HTML无效.多个tbody
混淆了pandas
解析器逻辑.如果您无法修复输入html本身,则必须预先对其进行解析,然后解包" 所有tbody
元素:
The HTML you have posted is not a valid one. Multiple tbody
s is what confuses the pandas
parser logic. If you cannot fix the input html itself, you have to pre-parse it and "unwrap" all the tbody
elements:
import pandas as pd
from bs4 import BeautifulSoup
html_table = '''
<table>
<thead>
<tr><th>Col1</th><th>Col2</th>
</thead>
<tbody>
<tr><td>1a</td><td>2a</td></tr>
</tbody>
<tbody>
<tr><td>1b</td><td>2b</td></tr>
</tbody>
</table>'''
# fix HTML
soup = BeautifulSoup(html_table, "html.parser")
for body in soup("tbody"):
body.unwrap()
df = pd.read_html(str(soup), flavor="bs4")
print(df[0])
打印:
Col1 Col2
0 1a 2a
1 1b 2b
这篇关于如何使用python pandas的read_html读取具有多个主体的html表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!