我试着用漂亮的汤在ajax页面上刮一张桌子,然后用TextTable库以表格的形式打印出来。
import BeautifulSoup
import urllib
import urllib2
import getpass
import cookielib
import texttable
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
...
def show_queue():
url = 'https://www.animenfo.com/radio/nowplaying.php'
values = {'ajax' : 'true', 'mod' : 'queue'}
data = urllib.urlencode(values)
f = opener.open(url, data)
soup = BeautifulSoup.BeautifulSoup(f)
stable = soup.find('table')
table = texttable.Texttable()
header = stable.findAll('th')
header_text = []
for th in header:
header_append = th.find(text=True)
header.append(header_append)
table.header(header_text)
rows = stable.find('tr')
for tr in rows:
cells = []
cols = tr.find('td')
for td in cols:
cells_append = td.find(text=True)
cells.append(cells_append)
table.add_row(cells)
s = table.draw
print s
...
尽管代码中显示了我正在尝试删除的HTML的URL,但下面是一个示例:
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<th>Artist - Title</th>
<th>Album</th>
<th>Album Type</th>
<th>Series</th>
<th>Duration</th>
<th>Type of Play</th>
<th>
<span title="...">Time to play</span>
</th>
</tr>
<tr>
<td class="row1">
<a href="..." class="songinfo">Song 1</a>
</td>
<td class="row1">
<a href="..." class="album_link">Album 1</a>
</td>
<td class="row1">...</td>
<td class="row1">
</td>
<td class="row1" style="text-align: center">
5:43
</td>
<td class="row1" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row1" style="text-align: center">
~0:00:00
</td>
</tr>
<tr>
<td class="row2">
<a href="..." class="songinfo">Song2</a>
</td>
<td class="row2">
<a href="..." class="album_link">Album 2</a>
</td>
<td class="row2">...</td>
<td class="row2">
</td>
<td class="row2" style="text-align: center">
6:16
</td>
<td class="row2" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row2" style="text-align: center">
~0:05:43
</td>
</tr>
<tr>
<td class="row1">
<a href="..." class="songinfo">Song 3</a>
</td>
<td class="row1">
<a href="..." class="album_link">Album 3</a>
</td>
<td class="row1">...</td>
<td class="row1">
</td>
<td class="row1" style="text-align: center">
4:13
</td>
<td class="row1" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row1" style="text-align: center">
~0:11:59
</td>
</tr>
<tr>
<td class="row2">
<a href="..." class="songinfo">Song 4</a>
</td>
<td class="row2">
<a href="..." class="album_link">Album 4</a>
</td>
<td class="row2">...</td>
<td class="row2">
</td>
<td class="row2" style="text-align: center">
5:34
</td>
<td class="row2" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row2" style="text-align: center">
~0:16:12
</td>
</tr>
<tr>
<td class="row1"><a href="..." class="songinfo">Song 5</a>
</td>
<td class="row1">
<a href="..." class="album_link">Album 5</a>
</td>
<td class="row1">...</td>
<td class="row1"></td>
<td class="row1" style="text-align: center">
4:23
</td>
<td class="row1" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row1" style="text-align: center">
~0:21:46
</td>
</tr>
<tr>
<td style="height: 5px;">
</td></tr>
<tr>
<td class="row2" style="font-style: italic; text-align: center;" colspan="5">There are x songs in the queue with a total length of x:y:z.</td>
</tr>
</tbody>
</table>
每当我试图运行这个脚本函数时,它都会在
TypeError: find() takes no keyword arguments
行上用header_append = th.find(text=True)
终止。我有点困惑,因为我似乎在做代码示例中所示的事情,它似乎应该可以工作,但却没有。简而言之,我该如何修复代码以避免出现类型错误?我做错了什么?
编辑:
我在编写脚本时提到的文章和文档:
http://segfault.in/2010/07/parsing-html-table-in-python-with-beautifulsoup/
http://oneau.wordpress.com/2010/05/30/simple-formatted-tables-in-python-with-texttable/
最佳答案
基本问题
解析器的行为正常。您只是使用相同的表达式来解析不同类型的元素。
修订规范
这里有一个片段,只关注返回被刮掉的列表。一旦有了列表,就可以轻松地格式化文本表:
import BeautifulSoup
def get_queue(data):
# Args:
# data: string, contains the html to be scraped
soup = BeautifulSoup.BeautifulSoup(data)
stable = soup.find('table')
header = stable.findAll('th')
headers = [ th.text for th in header ]
cells = [ ]
rows = stable.findAll('tr')
for tr in rows[1:-2]:
# Process the body of the table
row = []
td = tr.findAll('td')
row.append( td[0].find('a').text )
row.append( td[1].find('a').text )
row.extend( [ td.text for td in td[2:] ] )
cells.append( row )
footer = rows[-1].find('td').text
return headers, cells, footer
输出
headers
、cells
和footer
,单元格现在可以输入到texttable
格式化函数中:import texttable
def show_table(headers, cells, footer):
retval = ''
table = texttable.Texttable()
table.header(headers)
for cell in cells:
table.add_row(cell)
retval = table.draw()
return retval + '\n' + footer
print show_table(headers, cells, footer)
+----------+----------+----------+----------+----------+----------+----------+
| Artist - | Album | Album | Series | Duration | Type of | Time to |
| Title | | Type | | | Play | play |
+==========+==========+==========+==========+==========+==========+==========+
| Song 1 | Album 1 | ... | | 5:43 | S.A.M. | ~0:00:00 |
+----------+----------+----------+----------+----------+----------+----------+
| Song2 | Album 2 | ... | | 6:16 | S.A.M. | ~0:05:43 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 3 | Album 3 | ... | | 4:13 | S.A.M. | ~0:11:59 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 4 | Album 4 | ... | | 5:34 | S.A.M. | ~0:16:12 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 5 | Album 5 | ... | | 4:23 | S.A.M. | ~0:21:46 |
+----------+----------+----------+----------+----------+----------+----------+
There are x songs in the queue with a total length of x:y:z.