本文介绍了如何通过网络使用Python抓取图表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试使用Python 3将本网站的图表从Web上抓取到.csv文件中:
这些是网站上图表到.csv文件中的前六行.注意如何多次使用多个日期.如何实施刮板以获取此输出?
解决方案
import re汇入要求将熊猫作为pd导入从bs4导入BeautifulSoup从itertools导入groupbyurl ='https://fanside.com/2016/08/11/nba-schedule-2016-national-tv-games/'汤= BeautifulSoup(requests.get(url).content,'html.parser')天=星期一",星期二",星期三",星期四",星期五",星期六",星期日"数据=汤.select_one('.article-content p:has(br)').get_text(strip = True,分隔符='|').split('|')日期,最后一个= {},''对于groupby中的v,g(数据,lambda k:任意(d中的d,以天为单位的d)):如果v:最后= [* g] [0]日期[最后] = []别的:date [last] .extend([re.findall(r'([\ d:] + [AP] M)(.*?)/(.*?)(.*)',d)[0] for d在g]中)all_data = {'Date':[],'Time':[],'Team 1':[],'Team 2':[],'Network':[]}对于dates.items()中的k,v:对于时间,团队1,团队2,v中的网络:all_data ['Date'].append(k)all_data ['Time'].append(time)all_data ['Team 1'].append(team1)all_data ['Team 2'].append(team2)all_data ['Network'].append(network)df = pd.DataFrame(all_data)打印(df)df.to_csv('data.csv')
打印:
日期时间Team 1 Team 2网络10月25日,星期二,08:00尼克斯骑士队TNT10月25日,星期二,晚上10:30马刺勇士队TNT10月2日,星期三,晚上8:00迅雷76人队ESPN10月26日,星期三,10:30 PM火箭湖人队ESPN10月4日,星期四,晚上8:00凯尔特人公牛队TNT.. ... ... ... ... ...159 4月8日星期六晚上8:30快船马刺ABC160 4月10日,星期一下午8:00奇才活塞队TNT161 4月10日,星期一10:30 PM火箭快船TNT162 4月12日,星期三,晚上8:00鹰队步行者ESPN163年4月12日,星期三,晚上10:30鹈鹕队开拓者ESPN[164行x 5列]
并保存 data.csv
(来自Libre Office的屏幕截图):
I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2016 NBA National TV Schedule
The chart starts out like:
Tuesday, October 25
8:00 PM Knicks/Cavaliers TNT
10:30 PM Spurs/Warriors TNT
Wednesday, October 26
8:00 PM Thunder/Sixers ESPN
10:30 PM Rockets/Lakers ESPN
I am using these packages:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
The output I want in a .csv file looks like this:
These are the first six lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once. How do I implement the scraper to get this output?
解决方案
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
url = 'https://fansided.com/2016/08/11/nba-schedule-2016-national-tv-games/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
days = 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'
data = soup.select_one('.article-content p:has(br)').get_text(strip=True, separator='|').split('|')
dates, last = {}, ''
for v, g in groupby(data, lambda k: any(d in k for d in days)):
if v:
last = [*g][0]
dates[last] = []
else:
dates[last].extend([re.findall(r'([\d:]+ [AP]M) (.*?)/(.*?) (.*)', d)[0] for d in g])
all_data = {'Date':[], 'Time': [], 'Team 1': [], 'Team 2': [], 'Network': []}
for k, v in dates.items():
for time, team1, team2, network in v:
all_data['Date'].append(k)
all_data['Time'].append(time)
all_data['Team 1'].append(team1)
all_data['Team 2'].append(team2)
all_data['Network'].append(network)
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
Prints:
Date Time Team 1 Team 2 Network
0 Tuesday, October 25 8:00 PM Knicks Cavaliers TNT
1 Tuesday, October 25 10:30 PM Spurs Warriors TNT
2 Wednesday, October 26 8:00 PM Thunder Sixers ESPN
3 Wednesday, October 26 10:30 PM Rockets Lakers ESPN
4 Thursday, October 27 8:00 PM Celtics Bulls TNT
.. ... ... ... ... ...
159 Saturday, April 8 8:30 PM Clippers Spurs ABC
160 Monday, April 10 8:00 PM Wizards Pistons TNT
161 Monday, April 10 10:30 PM Rockets Clippers TNT
162 Wednesday, April 12 8:00 PM Hawks Pacers ESPN
163 Wednesday, April 12 10:30 PM Pelicans Blazers ESPN
[164 rows x 5 columns]
And saves data.csv
(screenshot from Libre Office):
这篇关于如何通过网络使用Python抓取图表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!