我正在使用BeautifulSoup刮擦一张桌子的网页,但是由于某种原因,它只刮擦了一半桌子。我得到的一半是不包含输入字段的部分。这是html数据:

<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
    <tbody>
        <tr>
            <th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
        </tr>
        <tr>
            <td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
        </tr>
        <tr>
            <td>
                <span>AdvisorGuided (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>AdvisorGuided 2 (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Client Directed (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Holding MMKT (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Total</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
    </tbody>
</table>


这是我的代码:


url = driver.page_source

soup = BeautifulSoup(url, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))



我究竟做错了什么?为什么只打印表格的左侧?任何有关更改内容的建议将不胜感激。

Results:

 Portfolio Allocation (%)


AdvisorGuided (Capital Portfolio)
 100 100




AdvisorGuided 2 (Capital Portfolio)
 0 100




Client Directed (Capital Portfolio)
 0 100




Holding MMKT (Capital Portfolio)
 0 100




Total
 100 100

最佳答案

您必须进一步进入子节点和兄弟节点并拉出属性(这些值不是实际的文本/内容。

import pandas as pd
import bs4


html = '''<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
    <tbody>
        <tr>
            <th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
        </tr>
        <tr>
            <td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
        </tr>
        <tr>
            <td>
                <span>AdvisorGuided (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>AdvisorGuided 2 (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Client Directed (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Holding MMKT (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Total</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
    </tbody>
</table>'''


soup = bs4.BeautifulSoup(html, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.find_all(["th","td"]):
        text = cell.text
        try:
            val = cell.find('input')['value']
            max_val = cell.find('input').next_sibling['maxvalue']
            list_of_cells.append(val)
            list_of_cells.append(max_val)
        except:
            pass
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))


要制作桌子,您可以执行以下操作。您将需要进行一些清理工作,但是应该可以帮助您:

results = pd.DataFrame()
for row in table.findAll('tr'):
    for cell in row.find_all(["th","td"]):
        text = cell.text
        try:
            val = cell.find('input')['value']
            max_val = cell.find('input').next_sibling['maxvalue']
        except:
            val = ''
            max_val = ''
            pass

        temp_df = pd.DataFrame([[text, val, max_val]], columns=['text','value','maxvalue'])
        results = results.append(temp_df).reset_index(drop=True)

关于python - BeautifulSoup只能刮我 table 的一半吗?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56807154/

10-15 17:42