python - 使用python和beautifulsoup在一组表下选择一组特定的单元格

考虑有N个网页。
每个网页都有一个或多个表。这些表的共同之处在于它们的类相同，请考虑“ table_class”。
我们需要每个表的同一列[第三列，标题是标题]下的内容。
内容含义，href链接所有行的第三列。
有些行可能只是纯文本，有些行中可能包含href链接。
您应该将每个href链接单独打印在一行中。
使用属性进行过滤是无效的，因为某些标签具有不同的属性。单元格的位置是唯一可用的提示。

您如何编码？

考虑网页的以下两个链接：

http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2014
http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2013

考虑下表：Wikitable

必填内容：标题栏的href链接

我尝试一页的代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer


content = urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)

for sp in soup.find_all('tr'):
    for bt in sp.find_all('td'):
        for link in bt.find_all('a'):
            print(link.get("href"))
    print()

最佳答案

这个想法是用table类遍历每个wikitable。对于每个table，直接在i标记内直接在td内直接在tr内找到链接：

import requests
from bs4 import BeautifulSoup

url = "http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2014"
soup = BeautifulSoup(requests.get(url).content)

# iterate over tables
for table in soup.select('table.wikitable.sortable'):
    # get the table header/description, continue if not found
    h3 = table.find_previous_sibling('h3')
    if h3 is None:
        continue
    print h3.text

    # get the links
    for link in table.select('tr > td > i > a'):
        print link.text, "|", link.get('href', '')

    print "------"

打印（为清楚起见，还打印表格名称）：

January 2014–june 2014[edit]
Celebrity | /wiki/Celebrity
Kshatriya | /wiki/Kshatriya
1: Nenokkadine | /wiki/1:_Nenokkadine
...
Oohalu Gusagusalade | /wiki/Oohalu_Gusagusalade
Autonagar Surya | /wiki/Autonagar_Surya
------
July 2014 – December 2014[edit]
...
O Manishi Katha | /wiki/O_Manishi_Katha
Mukunda | /wiki/Mukunda
Chinnadana Nee Kosam | /wiki/Chinnadana_Nee_Kosam
------

关于python - 使用python和beautifulsoup在一组表下选择一组特定的单元格，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/29525613/