快速提问。
我正在尝试在df中创建一列,以对其他列中的值进行分类。看下面我的代码。
df['maker_grp'] = np.nan
for key in df[df['maker_nm'].str.contains("Sam|Mike")].index:
df['maker_grp'][key] = 'Class1'
for key in df[df['maker_nm'].str.contains("Andy|John|Paul|Jay")].index:
df['maker_grp'][key] = 'Class2'
df['maker_grp'] = df.maker_grp.fillna('Class3')
它完美地工作,但是我只是感觉到有Python方式可以减少处理。帮帮我。谢谢
最佳答案
使用numpy.select
:
m1 = df['maker_nm'].str.contains("Sam|Mike")
m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")
df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')
样品:
df = pd.DataFrame({'maker_nm':['Sam 1','Joe 5','Paul 7','Mike 0']})
#print (df)
m1 = df['maker_nm'].str.contains("Sam|Mike")
m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")
df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')
print (df)
maker_nm maker_grp
0 Sam 1 Class1
1 Joe 5 Class3
2 Paul 7 Class2
3 Mike 0 Class1
如果具有自定义功能的许多条件
apply
应该更快:import re
def f(x):
p1 = re.compile("Sam|Mike")
p2 = re.compile("Andy|John|Paul|Jay")
if p1.match(x):
return 'Class1'
elif p2.match(x):
return 'Class2'
else:
return 'Class3'
df['maker_grp'] = df['maker_nm'].apply(f)
时间:
df = pd.DataFrame({'maker_nm':['Sam 1','Joe 5','Paul 7','Mike 0']})
df = pd.concat([df] * 1000, ignore_index=True)
#print (df)
In [117]: %%timeit
...: df['maker_grp'] = np.nan
...: for key in df[df['maker_nm'].str.contains("Sam|Mike")].index:
...: df['maker_grp'][key] = 'Class1'
...: for key in df[df['maker_nm'].str.contains("Andy|John|Paul|Jay")].index:
...: df['maker_grp'][key] = 'Class2'
...: df['maker_grp'] = df.maker_grp.fillna('Class3')
...:
In [118]: %%timeit
...: m1 = df['maker_nm'].str.contains("Sam|Mike")
...: m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")
...:
...: df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')
...:
100 loops, best of 3: 5.98 ms per loop
In [119]: %%timeit
...: df['maker_grp'] = df['maker_nm'].apply(f)
...:
100 loops, best of 3: 7.38 ms per loop
警告:
性能实际上取决于数据和条件数量。
编辑:对于许多条件检查子字符串更快
apply
:m1 = df['maker_nm'].str.contains("Sam", regex=False)
m2 = df['maker_nm'].str.contains("Mike", regex=False)
m3 = df['maker_nm'].str.contains("Andy", regex=False)
m4 = df['maker_nm'].str.contains("John", regex=False)
m5 = df['maker_nm'].str.contains("Jay", regex=False)
df['maker_grp'] = np.select([m1,m2,m3,m4,m5], ['Class1','Class1', 'Class2','Class2','Class2'], default='Class3')
print (df)
def f(x):
if 'Sam' in x:
return 'Class1'
elif 'Mike' in x:
return 'Class1'
elif 'Andy' in x:
return 'Class2'
elif 'John' in x:
return 'Class2'
elif 'Paul' in x:
return 'Class2'
elif 'Jay' in x:
return 'Class2'
else:
return 'Class3'
df['maker_grp'] = df['maker_nm'].apply(f)
print (df)
In [133]: %%timeit
...: m1 = df['maker_nm'].str.contains("Sam", regex=False)
...: m2 = df['maker_nm'].str.contains("Mike", regex=False)
...: m3 = df['maker_nm'].str.contains("Andy", regex=False)
...: m4 = df['maker_nm'].str.contains("John", regex=False)
...: m5 = df['maker_nm'].str.contains("Jay", regex=False)
...:
...: df['maker_grp'] = np.select([m1,m2,m3,m4,m5], ['Class1','Class1', 'Class2','Class2','Class2'], default='Class3')
...:
100 loops, best of 3: 5.79 ms per loop
In [134]: %%timeit
...: df['maker_grp'] = df['maker_nm'].apply(f)
...:
1000 loops, best of 3: 1.41 ms per loop