我一直试图编写一个脚本,从html页面获取数据并将其保存到.csv文件中。但是我遇到了三个小问题。
首先,当保存到.csv时,我会得到一些不需要的换行符,这会弄乱输出文件。
其次,球员的名字(数据涉及NBA球员)出现两次。
from bs4 import BeautifulSoup
import requests
import time
teams = ['atlanta-hawks', 'boston-celtics', 'brooklyn-nets']
seasons = []
a=2018
while (a>2016):
seasons.append(str(a))
a-=1
print(seasons)
for season in seasons:
for team in teams:
my_url = ' https://www.spotrac.com/nba/'+team+'/cap/'+ season +'/'
headers = {"User-Agent" : "Mozilla/5.0"}
response = requests.get(my_url)
response.content
soup = BeautifulSoup(response.content, 'html.parser')
stat_table = soup.find_all('table', class_ = 'datatable')
my_table = stat_table[0]
plik = team + season + '.csv'
with open (plik, 'w') as r:
for row in my_table.find_all('tr'):
for cell in row.find_all('th'):
r.write(cell.text)
r.write(";")
for row in my_table.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text)
r.write(";")
此外,一些由“.”分隔的数字将自动转换为日期。
有什么办法可以解决这些问题吗?
Screenshot of output file
最佳答案
Richard提供了一个完整的答案,适用于3.6+版本。
不过,它对每个单元格执行file.write()
,这是不必要的,因此这里有一个str.format()的替代方法,它适用于3.6之前的python版本,并且每行写入一次:
from bs4 import BeautifulSoup
import requests
import time
teams = ['atlanta-hawks', 'boston-celtics', 'brooklyn-nets']
seasons = [2018, 2017]
for season in seasons:
for team in teams:
my_url = 'https://www.spotrac.com/nba/{}/cap/{}/'.format(team, season)
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(my_url)
response.content
soup = BeautifulSoup(response.content, 'html.parser')
stat_table = soup.find_all('table', class_ = 'datatable')
my_table = stat_table[0]
csv_file = '{}-{}.csv'.format(team, season)
with open(csv_file, 'w') as r:
for row in my_table.find_all('tr'):
row_string = ''
for cell in row.find_all('th'):
row_string='{}{};'.format(row_string, cell.text.strip())
for i, cell in enumerate(row.find_all('td')):
cell_string = cell.a.text if i==0 else cell.text
row_string='{}{};'.format(row_string, cell_string)
r.write("{}\n".format(row_string))
关于python - HTML Scraping with Beautiful Soup-多余的换行符,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54848064/