我正在使用BeautifulSoup刮擦一张桌子的网页,但是由于某种原因,它只刮擦了一半桌子。我得到的一半是不包含输入字段的部分。这是html数据:
<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
<tbody>
<tr>
<th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
</tr>
<tr>
<td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
</tr>
<tr>
<td>
<span>AdvisorGuided (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>AdvisorGuided 2 (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Client Directed (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Holding MMKT (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Total</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
</tbody>
</table>
这是我的代码:
url = driver.page_source
soup = BeautifulSoup(url, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll(["th","td"]):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
print(' '.join(item))
我究竟做错了什么?为什么只打印表格的左侧?任何有关更改内容的建议将不胜感激。
Results:
Portfolio Allocation (%)
AdvisorGuided (Capital Portfolio)
100 100
AdvisorGuided 2 (Capital Portfolio)
0 100
Client Directed (Capital Portfolio)
0 100
Holding MMKT (Capital Portfolio)
0 100
Total
100 100
最佳答案
您必须进一步进入子节点和兄弟节点并拉出属性(这些值不是实际的文本/内容。
import pandas as pd
import bs4
html = '''<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
<tbody>
<tr>
<th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
</tr>
<tr>
<td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
</tr>
<tr>
<td>
<span>AdvisorGuided (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>AdvisorGuided 2 (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Client Directed (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Holding MMKT (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Total</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
</tbody>
</table>'''
soup = bs4.BeautifulSoup(html, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.find_all(["th","td"]):
text = cell.text
try:
val = cell.find('input')['value']
max_val = cell.find('input').next_sibling['maxvalue']
list_of_cells.append(val)
list_of_cells.append(max_val)
except:
pass
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
print(' '.join(item))
要制作桌子,您可以执行以下操作。您将需要进行一些清理工作,但是应该可以帮助您:
results = pd.DataFrame()
for row in table.findAll('tr'):
for cell in row.find_all(["th","td"]):
text = cell.text
try:
val = cell.find('input')['value']
max_val = cell.find('input').next_sibling['maxvalue']
except:
val = ''
max_val = ''
pass
temp_df = pd.DataFrame([[text, val, max_val]], columns=['text','value','maxvalue'])
results = results.append(temp_df).reset_index(drop=True)
关于python - BeautifulSoup只能刮我 table 的一半吗?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56807154/