问题描述
此代码从
当分数存在时,table-score
被填充
当分数不存在时,table-score
不存在
现在,home_odds
、away_odds
和 draw_odds
的列值会在 table-score
不是存在并因此错误地提供数据.
我该如何改变
game_data.score.append(row[2])game_data.home_odds.append(row[3])game_data.draw_odds.append(row[4])game_data.away_odds.append(row[5])
这样,如果 table-score
不存在,game_data.score.append(row[2])
将返回 NaN
和
game_data.home_odds.append(row[2])game_data.draw_odds.append(row[3])game_data.away_odds.append(row[4])
否则当前的输出是什么?
您需要先:
from numpy import nan
然后修改代码如下:
...# 分数存在吗?如果 ':' 不在行 [2] 中:# 不,右移几列:行[5], 行[4], 行[3], 行[2] = 行[4], 行[3], 行[2], nangame_data.score.append(row[2])game_data.home_odds.append(nan if row[3] == '-' else row[3])game_data.draw_odds.append(nan if row[4] == '-' else row[4])game_data.away_odds.append(nan if row[5] == '-' else row[5])...
请注意,必须修改 generate_matches
以返回 list
实例而不是 tuple
实例,因为上面的代码现在要求返回值,即row
,可修改.
综合起来:
将pandas导入为pd从 numpy 进口南从 bs4 导入 BeautifulSoup 作为 bs从硒导入网络驱动程序进口螺纹从 multiprocessing.pool 导入线程池,池从 functools 导入部分导入操作系统进口重新类驱动程序:def __init__(self):选项 = webdriver.ChromeOptions()options.add_argument("--headless")# 取消注释下一行以禁止日志记录:options.add_experimental_option('excludeSwitches', ['enable-logging'])self.driver = webdriver.Chrome(options=options)def __del__(self):self.driver.quit() # 当我们清理的时候清理驱动# print('驱动程序已被退出".')threadLocal = threading.local()def create_driver():the_driver = getattr(threadLocal, 'the_driver', None)如果 the_driver 是 None:the_driver = 驱动程序()setattr(threadLocal, 'the_driver', the_driver)返回 the_driver.driver类游戏数据:def __init__(self):self.date = []self.time = []self.game = []self.score = []self.home_odds = []self.draw_odds = []self.away_odds = []self.country = []self.league = []def generate_matches(table):tr_tags = table.findAll('tr')对于 tr_tags 中的 tr_tag:如果 tr_tag.attrs 中的 'class' 和 tr_tag['class'] 中的 'dark':th_tag = tr_tag.find('th', {'class': 'first2 tl'})a_tags = th_tag.findAll('a')country = a_tags[0].text联赛 = a_tags[1].text别的:td_tags = tr_tag.findAll('td')产量 [td_tags[0].text, td_tags[1].text, td_tags[2].text, td_tags[3].text, \td_tags[4].text、td_tags[5].text、国家、联赛]def parse_data(process_pool, url, return_urls=False):浏览器 = create_driver()browser.get(url)# 等待初始内容用分数动态更新:browser.implicitly_wait(5)table = browser.find_element_by_xpath('//*[@id="table-matches"]/table')# 如果你没有给这个函数传递一个Pool实例来使用# 多处理用于 CPU 密集型工作,# 然后将下一条语句替换为: return process_page(browser.page_source, return_urls)返回 process_pool.apply(process_page, args=(browser.page_source, return_urls))def process_page(page_source, return_urls):汤 = bs(page_source, lxml")div = 汤.find('div', {'id': 'table-matches'})table = div.find('table', {'class': 'table-main'})h1 = 汤.find('h1').text打印(h1)m = re.search(r'\d+ \w+ \d{4}$', h1)游戏日期 = m[0]游戏数据 = 游戏数据()对于 generate_matches(table) 中的行:game_data.date.append(game_date)game_data.time.append(row[0])game_data.game.append(row[1])# 分数存在吗?如果 ':' 不在行 [2] 中:# 不,右移几列:行[5], 行[4], 行[3], 行[2] = 行[4], 行[3], 行[2], nangame_data.score.append(row[2])game_data.home_odds.append(nan if row[3] == '-' else row[3])game_data.draw_odds.append(nan if row[4] == '-' else row[4])game_data.away_odds.append(nan if row[5] == '-' else row[5])game_data.country.append(row[6])game_data.league.append(row[7])如果 return_urls:span = soup.find('span', {'class': 'next-games-date'})a_tags = span.findAll('a')urls = ['https://www.oddsportal.com' + a_tag['href'] 用于 a_tags 中的 a_tag]返回游戏数据,网址返回游戏数据如果 __name__ == '__main__':结果 = 无pool = ThreadPool(3) # 这似乎是这个应用程序的最佳选择# 创建多处理池来进行 CPU 密集型处理:process_pool = Pool(min(5, os.cpu_count())) # 5 似乎是这个应用程序的最佳选择# 获取今天的数据和其他日子的网址:game_data_today, urls = pool.apply(parse_data, args=(process_pool, 'https://www.oddsportal.com/matches/soccer', True))urls.pop(1) # 删除今天的 url: 我们已经有了那个数据game_data_results = pool.imap(partial(parse_data, process_pool), urls)对于范围内的 i (8):game_data = game_data_today if i == 1 else next(game_data_results)结果 = pd.DataFrame(game_data.__dict__)如果结果为无:结果 = 结果别的:结果 = results.append(result, ignore_index=True)打印(结果)# 打印(结果.头())# 确保所有驱动程序都退出":删除线程本地
打印:
下一场足球比赛:今天,2021 年 9 月 10 日下一场足球比赛:2021 年 9 月 14 日,星期二下一场足球比赛:2021 年 9 月 15 日,星期三下一场足球比赛:2021 年 9 月 16 日,星期四下一场足球比赛:昨天,2021 年 9 月 9 日下一场足球比赛:2021 年 9 月 12 日,星期日下一场足球比赛:2021 年 9 月 13 日,星期一下一场足球比赛:明天,2021 年 9 月 11 日日期时间比赛得分 home_odds draw_odds away_odds country League2021 年 9 月 09 日 00:00 Cumbaya - Guayaquil SC 1:0 -169 +263 +462 厄瓜多尔乙级联赛2021 年 9 月 1 日 00:00 FC 塔尔萨 - 印地十一 2:1 -104 +265 +237 美国 USL 锦标赛2021 年 9 月 2 日 00:05 Pumas Tabasco - Atlante 0:2 +221 +186 +134 墨西哥 Liga de Expansion MX2021 年 9 月 3 日 00:05 巴拿马 - 墨西哥 1:1 +518 +250 -156 2022 年世界杯2021 年 9 月 4 日 00:10 Defensa y Justicia - Tigre 0:1 笔.+138 +199 +214 阿根廷杯阿根廷……………………………………1987 2021 年 9 月 16 日 19:00 奥林匹亚科斯比雷埃夫斯 - 安特卫普 NaN -137 +296 +371 欧洲欧洲联赛1988 2021 年 9 月 16 日 19:15 Academica - Estrela NaN -106 +231 +290 葡萄牙联赛葡萄牙 21989 2021 年 9 月 16 日 21:00 Barnechea - Rangers NaN +202 +202 +127 智利甲级联赛 B1990 2021 年 9 月 16 日 22:00 San Marcos de Arica - S. Morning NaN +212 +214 +122 智利甲级联赛 B1991 2021 年 9 月 16 日 23:30 U. De Concepcion - Coquimbo NaN +158 +198 +162 智利甲级联赛 B[1992 行 x 9 列]
This code gets data from www.oddsportal.com
How can I accomodate for when there is no score present for any event in this code?
Currently, the code scrapes all data from the pages:
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import threading
from multiprocessing.pool import ThreadPool
import os
import re
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
# Un-comment next line to supress logging:
options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.driver = webdriver.Chrome(options=options)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
# print('The driver has been "quitted".')
threadLocal = threading.local()
def create_driver():
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = Driver()
setattr(threadLocal, 'the_driver', the_driver)
return the_driver.driver
class GameData:
def __init__(self):
self.date = []
self.time = []
self.game = []
self.score = []
self.home_odds = []
self.draw_odds = []
self.away_odds = []
self.country = []
self.league = []
def generate_matches(table):
tr_tags = table.findAll('tr')
for tr_tag in tr_tags:
if 'class' in tr_tag.attrs and 'dark' in tr_tag['class']:
th_tag = tr_tag.find('th', {'class': 'first2 tl'})
a_tags = th_tag.findAll('a')
country = a_tags[0].text
league = a_tags[1].text
else:
td_tags = tr_tag.findAll('td')
yield td_tags[0].text, td_tags[1].text, td_tags[2].text, td_tags[3].text, \
td_tags[4].text, td_tags[5].text, country, league
def parse_data(url, return_urls=False):
browser = create_driver()
browser.get(url)
soup = bs(browser.page_source, "lxml")
div = soup.find('div', {'id': 'col-content'})
table = div.find('table', {'class': 'table-main'})
h1 = soup.find('h1').text
print(h1)
m = re.search(r'\d+ \w+ \d{4}$', h1)
game_date = m[0]
game_data = GameData()
for row in generate_matches(table):
game_data.date.append(game_date)
game_data.time.append(row[0])
game_data.game.append(row[1])
game_data.score.append(row[2])
game_data.home_odds.append(row[3])
game_data.draw_odds.append(row[4])
game_data.away_odds.append(row[5])
game_data.country.append(row[6])
game_data.league.append(row[7])
if return_urls:
span = soup.find('span', {'class': 'next-games-date'})
a_tags = span.findAll('a')
urls = ['https://www.oddsportal.com' + a_tag['href'] for a_tag in a_tags]
return game_data, urls
return game_data
if __name__ == '__main__':
results = None
pool = ThreadPool(5) # We will be getting, however, 7 URLs
# Get today's data and the Urls for the other days:
game_data_today, urls = pool.apply(parse_data, args=('https://www.oddsportal.com/matches/soccer', True))
urls.pop(1) # Remove url for today: We already have the data for that
game_data_results = pool.imap(parse_data, urls)
for i in range(8):
game_data = game_data_today if i == 1 else next(game_data_results)
result = pd.DataFrame(game_data.__dict__)
if results is None:
results = result
else:
results = results.append(result, ignore_index=True)
print(results)
# print(results.head())
# ensure all the drivers are "quitted":
del threadLocal
import gc
gc.collect() # a little extra insurance
When scores are present, table-score
is populated
When scores are not present, table-score
is not present
Right now, the column values for home_odds
, away_odds
and draw_odds
shift when table-score
is not present and hence, providing data incorrectly.
How can I change
game_data.score.append(row[2])
game_data.home_odds.append(row[3])
game_data.draw_odds.append(row[4])
game_data.away_odds.append(row[5])
such that if table-score
is not present, game_data.score.append(row[2])
would return NaN
and
game_data.home_odds.append(row[2])
game_data.draw_odds.append(row[3])
game_data.away_odds.append(row[4])
else as the output is currently?
You need to first:
from numpy import nan
And then modify code as follows:
...
# Score present?
if ':' not in row[2]:
# No, shift a few columns right:
row[5], row[4], row[3], row[2] = row[4], row[3], row[2], nan
game_data.score.append(row[2])
game_data.home_odds.append(nan if row[3] == '-' else row[3])
game_data.draw_odds.append(nan if row[4] == '-' else row[4])
game_data.away_odds.append(nan if row[5] == '-' else row[5])
...
Note that generate_matches
has to be modified to return list
instances rather than tuple
instances since the above code now requires that the return values, i.e. row
, be modifiable.
Putting it all together:
import pandas as pd
from numpy import nan
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import threading
from multiprocessing.pool import ThreadPool, Pool
from functools import partial
import os
import re
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
# Un-comment next line to supress logging:
options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.driver = webdriver.Chrome(options=options)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
# print('The driver has been "quitted".')
threadLocal = threading.local()
def create_driver():
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = Driver()
setattr(threadLocal, 'the_driver', the_driver)
return the_driver.driver
class GameData:
def __init__(self):
self.date = []
self.time = []
self.game = []
self.score = []
self.home_odds = []
self.draw_odds = []
self.away_odds = []
self.country = []
self.league = []
def generate_matches(table):
tr_tags = table.findAll('tr')
for tr_tag in tr_tags:
if 'class' in tr_tag.attrs and 'dark' in tr_tag['class']:
th_tag = tr_tag.find('th', {'class': 'first2 tl'})
a_tags = th_tag.findAll('a')
country = a_tags[0].text
league = a_tags[1].text
else:
td_tags = tr_tag.findAll('td')
yield [td_tags[0].text, td_tags[1].text, td_tags[2].text, td_tags[3].text, \
td_tags[4].text, td_tags[5].text, country, league]
def parse_data(process_pool, url, return_urls=False):
browser = create_driver()
browser.get(url)
# Wait for initial content to be dynamically updated with scores:
browser.implicitly_wait(5)
table = browser.find_element_by_xpath('//*[@id="table-matches"]/table')
# If you do not pass a Pool instance to this function to use
# multiprocessing for the more CPU-intensive work,
# then just replace next statement with: return process_page(browser.page_source, return_urls)
return process_pool.apply(process_page, args=(browser.page_source, return_urls))
def process_page(page_source, return_urls):
soup = bs(page_source, "lxml")
div = soup.find('div', {'id': 'table-matches'})
table = div.find('table', {'class': 'table-main'})
h1 = soup.find('h1').text
print(h1)
m = re.search(r'\d+ \w+ \d{4}$', h1)
game_date = m[0]
game_data = GameData()
for row in generate_matches(table):
game_data.date.append(game_date)
game_data.time.append(row[0])
game_data.game.append(row[1])
# Score present?
if ':' not in row[2]:
# No, shift a few columns right:
row[5], row[4], row[3], row[2] = row[4], row[3], row[2], nan
game_data.score.append(row[2])
game_data.home_odds.append(nan if row[3] == '-' else row[3])
game_data.draw_odds.append(nan if row[4] == '-' else row[4])
game_data.away_odds.append(nan if row[5] == '-' else row[5])
game_data.country.append(row[6])
game_data.league.append(row[7])
if return_urls:
span = soup.find('span', {'class': 'next-games-date'})
a_tags = span.findAll('a')
urls = ['https://www.oddsportal.com' + a_tag['href'] for a_tag in a_tags]
return game_data, urls
return game_data
if __name__ == '__main__':
results = None
pool = ThreadPool(3) # This seems to be optimal for this application
# Create multiprocessing pool to do the CPU-intensive processing:
process_pool = Pool(min(5, os.cpu_count())) # 5 seems to be optimal for this application
# Get today's data and the Urls for the other days:
game_data_today, urls = pool.apply(parse_data, args=(process_pool, 'https://www.oddsportal.com/matches/soccer', True))
urls.pop(1) # Remove url for today: We already have the data for that
game_data_results = pool.imap(partial(parse_data, process_pool), urls)
for i in range(8):
game_data = game_data_today if i == 1 else next(game_data_results)
result = pd.DataFrame(game_data.__dict__)
if results is None:
results = result
else:
results = results.append(result, ignore_index=True)
print(results)
# print(results.head())
# ensure all the drivers are "quitted":
del threadLocal
Prints:
Next Soccer Matches: Today, 10 Sep 2021
Next Soccer Matches: Tuesday, 14 Sep 2021
Next Soccer Matches: Wednesday, 15 Sep 2021
Next Soccer Matches: Thursday, 16 Sep 2021
Next Soccer Matches: Yesterday, 09 Sep 2021
Next Soccer Matches: Sunday, 12 Sep 2021
Next Soccer Matches: Monday, 13 Sep 2021
Next Soccer Matches: Tomorrow, 11 Sep 2021
date time game score home_odds draw_odds away_odds country league
0 09 Sep 2021 00:00 Cumbaya - Guayaquil SC 1:0 -169 +263 +462 Ecuador Serie B
1 09 Sep 2021 00:00 FC Tulsa - Indy Eleven 2:1 -104 +265 +237 USA USL Championship
2 09 Sep 2021 00:05 Pumas Tabasco - Atlante 0:2 +221 +186 +134 Mexico Liga de Expansion MX
3 09 Sep 2021 00:05 Panama - Mexico 1:1 +518 +250 -156 World World Cup 2022
4 09 Sep 2021 00:10 Defensa y Justicia - Tigre 0:1 pen. +138 +199 +214 Argentina Copa Argentina
... ... ... ... ... ... ... ... ... ...
1987 16 Sep 2021 19:00 Olympiacos Piraeus - Antwerp NaN -137 +296 +371 Europe Europa League
1988 16 Sep 2021 19:15 Academica - Estrela NaN -106 +231 +290 Portugal Liga Portugal 2
1989 16 Sep 2021 21:00 Barnechea - Rangers NaN +202 +202 +127 Chile Primera B
1990 16 Sep 2021 22:00 San Marcos de Arica - S. Morning NaN +212 +214 +122 Chile Primera B
1991 16 Sep 2021 23:30 U. De Concepcion - Coquimbo NaN +158 +198 +162 Chile Primera B
[1992 rows x 9 columns]
这篇关于Beautifulsoup:当行不存在时,NaN else 值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!