我在python3上工作,我已经将html表转换为json对象,但是它并没有遍历整个表,只是给出了第一行的输出。
这是我的代码:
html_source= """<div><table cellspacing="0" cellpadding="4"
rules="all" border="2" id="ctl00_ContentPlaceHolder1_GridView1"
style="background-color:White;border-color:#3366CC;border-
width:2px;border-style:Solid;font-size:Medium;font-weight:bold;border-
collapse:collapse;">
<tr style="color:#CCCCFF;background-color:#003399;font-weight:bold;">
<th scope="col">AC NO</th><th scope="col">PART NO</th><th
scope="col">SR NO</th><th scope="col">Voter Name</th><th
scope="col">ID CARD NO</th><th scope="col">GENDER</th><th
scope="col">AGE</th><th scope="col"> </th><th scope="col">
</th>
</tr><tr style="color:#003399;background-color:White;">
<td>211</td><td>396</td><td>294</td><td>name 1</td><td>UVP7645302</td>
<td>M</td><td>28</td><td><input type="button" value="Polling Station
Address"onclick="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1','View Details$0')" style="width:150px;" /></td><td><input type="button" value="Family" onclick="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1&
#39;,'Family$0')" /></td>
</tr><th scope="col">AC NO</th><th scope="col">PART NO</th><th
scope="col">SR NO</th><th scope="col">Voter Name</th><th
scope="col">ID CARD NO</th><th scope="col">GENDER</th><th
scope="col">AGE</th><th scope="col"> </th><th scope="col">
</th>
</tr><tr style="color:#003399;background-color:White;">
<td>211</td><td>396</td><td>295</td><td>name 2</td><td>UVP7645302</td>
<td>M</td><td>28</td><td><input type="button" value="Polling Station>"""
soup = BeautifulSoup(html_source,'html.parser')
for table in soup.find_all('table'):
keys = [th.get_text(strip=True)for th in table.find_all('th')]
values = [td.get_text(strip=True)for td in table.find_all('td')]
d = dict(zip(keys,values))
#print(d)
mydict = (json.dumps(d))
empty = {k: v for k, v in d.items() if not v}
for k in empty:
del d[k]
print(json.dumps(d,ensure_ascii=False))
我的预期输出:
{“ AC NO”:“ 211”,“ PART NO”:“ 396”,“ SR NO”:“ 294”,“ Voter Name”:“ name 1”,
“身份证号码”:“ UVP7645302”,“性别”:“ M”,“年龄”:“ 28”},{“ AC
NO”:“ 211”,“ PART NO”:“ 396”,“ SR NO”:“ 294”,“ Voter Name”:“ name 2”,
“身份证号码”:“ UVP7645302”,“性别”:“ M”,“年龄”:“ 28”}
实际输出:
{“ AC NO”:“ 211”,“ PART NO”:“ 396”,“ SR NO”:“ 294”,“ Voter Name”:“ name
1“,” ID卡编号“:” UVP7645302“,” GENDER“:” M“,” AGE“:” 28“}
最佳答案
使用pandas
库:
from bs4 import BeautifulSoup
import pandas as pd
html_source= """<div><table cellspacing="0" cellpadding="4"
rules="all" border="2" id="ctl00_ContentPlaceHolder1_GridView1"
style="background-color:White;border-color:#3366CC;border-
width:2px;border-style:Solid;font-size:Medium;font-weight:bold;border-
collapse:collapse;">
<tr style="color:#CCCCFF;background-color:#003399;font-weight:bold;">
<th scope="col">AC NO</th><th scope="col">PART NO</th><th
scope="col">SR NO</th><th scope="col">Voter Name</th><th
scope="col">ID CARD NO</th><th scope="col">GENDER</th><th
scope="col">AGE</th><th scope="col"> </th><th scope="col">
</th>
</tr><tr style="color:#003399;background-color:White;">
<td>211</td><td>396</td><td>294</td><td>name 1</td><td>UVP7645302</td>
<td>M</td><td>28</td><td><input type="button" value="Polling Station
Address"onclick="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1','View Details$0')" style="width:150px;" /></td><td><input type="button" value="Family" onclick="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1&
#39;,'Family$0')" /></td>
</tr><th scope="col">AC NO</th><th scope="col">PART NO</th><th
scope="col">SR NO</th><th scope="col">Voter Name</th><th
scope="col">ID CARD NO</th><th scope="col">GENDER</th><th
scope="col">AGE</th><th scope="col"> </th><th scope="col">
</th>
</tr><tr style="color:#003399;background-color:White;">
<td>211</td><td>396</td><td>295</td><td>name 2</td><td>UVP7645302</td>
<td>M</td><td>28</td><td><input type="button" value="Polling Station>"""
table = pd.read_html(html_source)[0]
print(table.to_dict('records'))
O / P:
[{'AC NO': 211, 'PART NO': 396, 'SR NO': 294, 'Voter Name': 'name 1', 'ID CARD NO': 'UVP7645302', 'GENDER': 'M', 'AGE': 28, 'Unnamed: 7': nan, 'Unnamed: 8': nan}, {'AC NO': 211, 'PART NO': 396, 'SR NO': 295, 'Voter Name': 'name 2', 'ID CARD NO': 'UVP7645302', 'GENDER': 'M', 'AGE': 28, 'Unnamed: 7': nan, 'Unnamed: 8': nan}]
如果要从字典中删除
Unnamed
,请在print(table.to_dict('records'))
语句之前添加此行table = table.loc[:,~table.columns.str.startswith('Unnamed')]
O / P:
[{'AC NO': 211, 'PART NO': 396, 'SR NO': 294, 'Voter Name': 'name 1', 'ID CARD NO': 'UVP7645302', 'GENDER': 'M', 'AGE': 28}, {'AC NO': 211, 'PART NO': 396, 'SR NO': 295, 'Voter Name': 'name 2', 'ID CARD NO': 'UVP7645302', 'GENDER': 'M', 'AGE': 28}]