我有一个数据帧(game_df
)和一个Postgres表(team_stats_1970_2017
)。game_df
由几千行数据组成,数据如下。。。
season_yr home_team visitor_team home_team_runs visitor_team_runs
0 2017 ARI SFG 6 5
1 2017 ARI SFG 4 8
2 2017 ARI SFG 8 6
3 2017 ARI SFG 9 3
4 2017 ARI CLE 7 3
5 2017 ARI CLE 11 2
6 2017 ATL LAD 2 3
team_stats_1970_2017
会有相应的数据 team season_yr r_per_g pa ab b_r b_h b2 b3 b_hr
0 ARI 2017 5.01 6224.0 5525 812 1405 314 39 220
1 ATL 2017 4.52 6216.0 5584 732 1467 289 26 165
2 CLE 2017 5.05 6234.0 5511 818 1449 333 29 212
3 LAD 2017 4.75 6191.0 5408 770 1347 312 20 221
4 SFG 2017 3.94 6137.0 5551 639 1382 290 28 128
例如,对于
game_df
的第1行,代码从Postgres的team_stats_1970_2017
中选择“ARI”和“SFG”数据,并由此创建特性。然后对game_df
中的其余行重复此操作。我目前正在使用
df.iterrows
,但我注意到它相当慢,因为我只是测试我的一小部分数据,它仍然需要一段时间。有没有更好/更快的替代方案?features = []
results = []
for index,row in game_df.iterrows():
import psycopg2 as pg2
connect = pg2.connect(login)
cursor=connect.cursor()
year, t1, t2, p1, p2 = row
p1, p2 = map(int, [p1, p2])
feature1 = cursor.execute("SELECT * FROM team_stats_1970_2017 WHERE team = (%s) and season_yr = (%s)",(t1,year))
feature1 = list(cursor.fetchone()[2::])
feature2 = cursor.execute("SELECT * FROM team_stats_1970_2017 WHERE team = (%s) and season_yr = (%s)",(t2,year))
feature2 = list(cursor.fetchone()[2::])
feature = np.array(feature2) - np.array(feature1)
## Calculate result of game
if (p2 - p1) > 0:
result = 1
else:
result = 0
features.append(feature)
results.append(result)
最佳答案
这是另一种更容易理解的方法,但使用merge
作为@sacul的解决方案。我将分别为df_visitor
和df_home
列中的每一行和团队创建两个数据帧team_stats_1970_2017
和'visitor_team'
的值,例如:
df_visitor = (game_df[['season_yr','visitor_team']].rename(columns={'visitor_team':'team'})
.merge(team_stats_1970_2017, how='left'))
df_home = (game_df[['season_yr','home_team']].rename(columns={'home_team':'team'})
.merge(team_stats_1970_2017, how='left'))
例如,你会得到:
season_yr team r_per_g pa ab b_r b_h b2 b3 b_hr
0 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
1 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
2 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
3 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
4 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
5 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
6 2017 ATL 4.52 6216.0 5584 732 1467 289 26 165
对于每一行,它是与列
'home_team'
中同一行的df_home
中的团队关联的team_stats_1970_2017
的值。现在要在原始数据帧上添加差异,可以执行以下操作:
# first get the lists of columns you want to add
col_features = team_stats_1970_2017.columns[2:]
game_df[col_features] = df_visitor[col_features] - df_home[col_features]
最后,要添加列结果,可以使用
game_df
:import numpy as np
game_df['results'] = np.where(game_df['visitor_team_runs'] > game_df['home_team_runs'], 1, 0)
# meaning if p2 > p1 then 1 else 0 as in your code I think
关于python - df.iterrows()的替代方案,用于连接两个Postgres表和计算功能,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51662445/