我有两个正在使用的数据帧,一个包含播放器列表,另一个包含来自另一个数据帧的播放器的逐播数据。下面显示了这两个数据帧中感兴趣的行的部分。
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
我要做的是在第一个数据框中创建一个列,该列计算字符串(df['Name']+'scored')在另一个数据框的列中出现的次数。例如,它会搜索“Matt Carpenter scored”、“Jason Heyward scored”等实例。我知道您可以使用str.contains来完成这类事情,但似乎只有在您输入显式字符串时才有效。例如,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
很好,但如果我试着
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
它返回错误“Series”对象是可变的,因此不能对它们进行散列。我已经研究过各种类似的问题,但我一生都找不到解决这个问题的办法。如有任何帮助,将不胜感激,谢谢!
最佳答案
我认为需要通过regex将所有值连接到findall
,然后通过Name
创建指示符列,并通过MultiLabelBinarizer
添加所有缺少的列:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
如有必要,最后
reindex
至join
:df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0