python - 使用BeautifulSoup进行爬取:要爬取整个列，包括标题行和标题行

我试图在具有代码“ SEVNYXX”的列下保存数据，其中“ XX”是使用Python在site上紧随其后的数字（例如01、02等）。

通过下面的代码，我可以获得所需的所有“列”数据的第一行。但是，有没有一种方法可以在其中包含标题和行标题？

我知道我有标题，但我想知道是否有办法将这些标题包含在输出的数据中？
而且，我又如何看待包括所有行？

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append(headers.index(th))

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells= row.findAll('td')
    for column in desired_columns:
        print(cells[column].text)

最佳答案

这个怎么样？

我添加了th.getText()并在所需的列上创建了一个列表，该列表拉出了列名，然后添加了row_name = row.findNext('th').getText()以获得该行。

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append([headers.index(th), th.getText()])

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells = row.findAll('td')
    row_name = row.findNext('th').getText()
    for column in desired_columns:
        print(cells[column[0]].text, row_name, column[1])

关于python - 使用BeautifulSoup进行爬取:要爬取整个列，包括标题行和标题行，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/30741576/