我试着用漂亮的汤在ajax页面上刮一张桌子,然后用TextTable库以表格的形式打印出来。

import BeautifulSoup
import urllib
import urllib2
import getpass
import cookielib
import texttable

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

...

def show_queue():
    url = 'https://www.animenfo.com/radio/nowplaying.php'
    values = {'ajax' : 'true', 'mod' : 'queue'}
    data = urllib.urlencode(values)
    f = opener.open(url, data)
    soup = BeautifulSoup.BeautifulSoup(f)
    stable = soup.find('table')
    table = texttable.Texttable()
    header = stable.findAll('th')
    header_text = []
    for th in header:
        header_append = th.find(text=True)
        header.append(header_append)
    table.header(header_text)
    rows = stable.find('tr')
    for tr in rows:
        cells = []
        cols = tr.find('td')
        for td in cols:
            cells_append = td.find(text=True)
            cells.append(cells_append)
        table.add_row(cells)
    s = table.draw
    print s

...

尽管代码中显示了我正在尝试删除的HTML的URL,但下面是一个示例:
<table cellspacing="0" cellpadding="0">
    <tbody>
        <tr>
                        <th>Artist - Title</th>
            <th>Album</th>
            <th>Album Type</th>
            <th>Series</th>
            <th>Duration</th>
            <th>Type of Play</th>
            <th>
                <span title="...">Time to play</span>
            </th>
                    </tr>
                <tr>
                        <td class="row1">
                <a href="..." class="songinfo">Song 1</a>
            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 1</a>
            </td>
            <td class="row1">...</td>
            <td class="row1">

            </td>
            <td class="row1" style="text-align: center">
                5:43
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:00:00
            </td>
                    </tr>
                <tr>
                        <td class="row2">
                <a href="..." class="songinfo">Song2</a>
            </td>
            <td class="row2">
                <a href="..." class="album_link">Album 2</a>
            </td>
            <td class="row2">...</td>
            <td class="row2">

            </td>
            <td class="row2" style="text-align: center">
                6:16
            </td>
            <td class="row2" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row2" style="text-align: center">
                ~0:05:43
            </td>
                    </tr>
                <tr>
                        <td class="row1">
                <a href="..." class="songinfo">Song 3</a>
            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 3</a>
            </td>
            <td class="row1">...</td>
            <td class="row1">

            </td>
            <td class="row1" style="text-align: center">
                4:13
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:11:59
            </td>
                    </tr>
                <tr>
                        <td class="row2">
                <a href="..." class="songinfo">Song 4</a>
            </td>
            <td class="row2">
                <a href="..." class="album_link">Album 4</a>
            </td>
            <td class="row2">...</td>
            <td class="row2">

            </td>
            <td class="row2" style="text-align: center">
                5:34
            </td>
            <td class="row2" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row2" style="text-align: center">
                ~0:16:12
            </td>
                    </tr>
                <tr>
                        <td class="row1"><a href="..." class="songinfo">Song 5</a>

            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 5</a>
            </td>
            <td class="row1">...</td>
            <td class="row1"></td>
            <td class="row1" style="text-align: center">
                4:23
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:21:46
            </td>
                    </tr>
                <tr>
            <td style="height: 5px;">
        </td></tr>
        <tr>
            <td class="row2" style="font-style: italic; text-align: center;" colspan="5">There are x songs in the queue with a total length of x:y:z.</td>
        </tr>
    </tbody>
</table>

每当我试图运行这个脚本函数时,它都会在TypeError: find() takes no keyword arguments行上用header_append = th.find(text=True)终止。我有点困惑,因为我似乎在做代码示例中所示的事情,它似乎应该可以工作,但却没有。
简而言之,我该如何修复代码以避免出现类型错误?我做错了什么?
编辑:
我在编写脚本时提到的文章和文档:
http://segfault.in/2010/07/parsing-html-table-in-python-with-beautifulsoup/
http://oneau.wordpress.com/2010/05/30/simple-formatted-tables-in-python-with-texttable/

最佳答案

基本问题
解析器的行为正常。您只是使用相同的表达式来解析不同类型的元素。
修订规范
这里有一个片段,只关注返回被刮掉的列表。一旦有了列表,就可以轻松地格式化文本表:

import BeautifulSoup

def get_queue(data):
    # Args:
    #   data: string, contains the html to be scraped
    soup = BeautifulSoup.BeautifulSoup(data)
    stable = soup.find('table')

    header = stable.findAll('th')
    headers = [ th.text for th in header ]

    cells = [ ]
    rows = stable.findAll('tr')
    for tr in rows[1:-2]:
        # Process the body of the table
        row = []
        td = tr.findAll('td')
        row.append( td[0].find('a').text )
        row.append( td[1].find('a').text )
        row.extend( [ td.text for td in td[2:] ] )
        cells.append( row )

    footer = rows[-1].find('td').text
    return headers, cells, footer

输出
headerscellsfooter,单元格现在可以输入到texttable格式化函数中:
import texttable
def show_table(headers, cells, footer):
    retval = ''
    table = texttable.Texttable()
    table.header(headers)
    for cell in cells:
        table.add_row(cell)
    retval = table.draw()
    return retval + '\n' + footer

print show_table(headers, cells, footer)

+----------+----------+----------+----------+----------+----------+----------+
| Artist - |  Album   |  Album   |  Series  | Duration | Type of  | Time to  |
|  Title   |          |   Type   |          |          |   Play   |   play   |
+==========+==========+==========+==========+==========+==========+==========+
| Song 1   | Album 1  | ...      |          | 5:43     | S.A.M.   | ~0:00:00 |
+----------+----------+----------+----------+----------+----------+----------+
| Song2    | Album 2  | ...      |          | 6:16     | S.A.M.   | ~0:05:43 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 3   | Album 3  | ...      |          | 4:13     | S.A.M.   | ~0:11:59 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 4   | Album 4  | ...      |          | 5:34     | S.A.M.   | ~0:16:12 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 5   | Album 5  | ...      |          | 4:23     | S.A.M.   | ~0:21:46 |
+----------+----------+----------+----------+----------+----------+----------+
There are x songs in the queue with a total length of x:y:z.

08-08 03:14