我有一个数据帧(game_df)和一个Postgres表(team_stats_1970_2017)。
game_df由几千行数据组成,数据如下。。。

      season_yr home_team visitor_team  home_team_runs  visitor_team_runs
0         2017       ARI          SFG               6                  5
1         2017       ARI          SFG               4                  8
2         2017       ARI          SFG               8                  6
3         2017       ARI          SFG               9                  3
4         2017       ARI          CLE               7                  3
5         2017       ARI          CLE              11                  2
6         2017       ATL          LAD               2                  3

team_stats_1970_2017会有相应的数据
   team season_yr  r_per_g pa      ab    b_r  b_h   b2   b3   b_hr
0  ARI      2017     5.01  6224.0  5525  812  1405  314  39   220
1  ATL      2017     4.52  6216.0  5584  732  1467  289  26   165
2  CLE      2017     5.05  6234.0  5511  818  1449  333  29   212
3  LAD      2017     4.75  6191.0  5408  770  1347  312  20   221
4  SFG      2017     3.94  6137.0  5551  639  1382  290  28   128

例如,对于game_df的第1行,代码从Postgres的team_stats_1970_2017中选择“ARI”和“SFG”数据,并由此创建特性。然后对game_df中的其余行重复此操作。
我目前正在使用df.iterrows,但我注意到它相当慢,因为我只是测试我的一小部分数据,它仍然需要一段时间。有没有更好/更快的替代方案?
features = []
results = []
for index,row in game_df.iterrows():
        import psycopg2 as pg2
        connect = pg2.connect(login)
        cursor=connect.cursor()

        year, t1, t2, p1, p2 = row
        p1, p2 = map(int, [p1, p2])
        feature1 = cursor.execute("SELECT * FROM team_stats_1970_2017 WHERE team = (%s) and season_yr = (%s)",(t1,year))
        feature1 = list(cursor.fetchone()[2::])
        feature2 = cursor.execute("SELECT * FROM team_stats_1970_2017 WHERE team = (%s) and season_yr = (%s)",(t2,year))
        feature2 = list(cursor.fetchone()[2::])
        feature = np.array(feature2) - np.array(feature1)

        ## Calculate result of game
        if (p2 - p1) > 0:
            result = 1
        else:
            result = 0

        features.append(feature)
        results.append(result)

最佳答案

这是另一种更容易理解的方法,但使用merge作为@sacul的解决方案。我将分别为df_visitordf_home列中的每一行和团队创建两个数据帧team_stats_1970_2017'visitor_team'的值,例如:

df_visitor = (game_df[['season_yr','visitor_team']].rename(columns={'visitor_team':'team'})
                                                   .merge(team_stats_1970_2017, how='left'))

df_home = (game_df[['season_yr','home_team']].rename(columns={'home_team':'team'})
                                             .merge(team_stats_1970_2017, how='left'))

例如,你会得到:
   season_yr team  r_per_g      pa    ab  b_r   b_h   b2  b3  b_hr
0       2017  ARI     5.01  6224.0  5525  812  1405  314  39   220
1       2017  ARI     5.01  6224.0  5525  812  1405  314  39   220
2       2017  ARI     5.01  6224.0  5525  812  1405  314  39   220
3       2017  ARI     5.01  6224.0  5525  812  1405  314  39   220
4       2017  ARI     5.01  6224.0  5525  812  1405  314  39   220
5       2017  ARI     5.01  6224.0  5525  812  1405  314  39   220
6       2017  ATL     4.52  6216.0  5584  732  1467  289  26   165

对于每一行,它是与列'home_team'中同一行的df_home中的团队关联的team_stats_1970_2017的值。
现在要在原始数据帧上添加差异,可以执行以下操作:
# first get the lists of columns you want to add
col_features = team_stats_1970_2017.columns[2:]
game_df[col_features] = df_visitor[col_features] - df_home[col_features]

最后,要添加列结果,可以使用game_df
import numpy as np
game_df['results'] = np.where(game_df['visitor_team_runs'] > game_df['home_team_runs'], 1, 0)
# meaning if p2 > p1 then 1 else 0 as in your code I think

关于python - df.iterrows()的替代方案,用于连接两个Postgres表和计算功能,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51662445/

10-14 19:35