快速提问。
我正在尝试在df中创建一列,以对其他列中的值进行分类。看下面我的代码。

df['maker_grp'] = np.nan
for key in df[df['maker_nm'].str.contains("Sam|Mike")].index:
    df['maker_grp'][key] = 'Class1'
for key in df[df['maker_nm'].str.contains("Andy|John|Paul|Jay")].index:
    df['maker_grp'][key] = 'Class2'
df['maker_grp'] = df.maker_grp.fillna('Class3')


它完美地工作,但是我只是感觉到有Python方式可以减少处理。帮帮我。谢谢

最佳答案

使用numpy.select

m1 = df['maker_nm'].str.contains("Sam|Mike")
m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")

df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')


样品:

df = pd.DataFrame({'maker_nm':['Sam 1','Joe 5','Paul 7','Mike 0']})
#print (df)

m1 = df['maker_nm'].str.contains("Sam|Mike")
m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")

df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')
print (df)
  maker_nm maker_grp
0    Sam 1    Class1
1    Joe 5    Class3
2   Paul 7    Class2
3   Mike 0    Class1


如果具有自定义功能的许多条件apply应该更快:

import re

def f(x):
    p1 = re.compile("Sam|Mike")
    p2 = re.compile("Andy|John|Paul|Jay")
    if p1.match(x):
        return 'Class1'
    elif p2.match(x):
        return 'Class2'
    else:
        return 'Class3'

df['maker_grp'] = df['maker_nm'].apply(f)


时间:

df = pd.DataFrame({'maker_nm':['Sam 1','Joe 5','Paul 7','Mike 0']})

df = pd.concat([df] * 1000, ignore_index=True)

#print (df)

In [117]: %%timeit
     ...: df['maker_grp'] = np.nan
     ...: for key in df[df['maker_nm'].str.contains("Sam|Mike")].index:
     ...:     df['maker_grp'][key] = 'Class1'
     ...: for key in df[df['maker_nm'].str.contains("Andy|John|Paul|Jay")].index:
     ...:     df['maker_grp'][key] = 'Class2'
     ...: df['maker_grp'] = df.maker_grp.fillna('Class3')
     ...:

In [118]: %%timeit
     ...: m1 = df['maker_nm'].str.contains("Sam|Mike")
     ...: m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")
     ...:
     ...: df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')
     ...:
100 loops, best of 3: 5.98 ms per loop

In [119]: %%timeit
     ...: df['maker_grp'] = df['maker_nm'].apply(f)
     ...:
100 loops, best of 3: 7.38 ms per loop


警告:

性能实际上取决于数据和条件数量。

编辑:对于许多条件检查子字符串更快apply

m1 = df['maker_nm'].str.contains("Sam", regex=False)
m2 = df['maker_nm'].str.contains("Mike", regex=False)
m3 = df['maker_nm'].str.contains("Andy", regex=False)
m4 = df['maker_nm'].str.contains("John", regex=False)
m5 = df['maker_nm'].str.contains("Jay", regex=False)

df['maker_grp'] = np.select([m1,m2,m3,m4,m5], ['Class1','Class1', 'Class2','Class2','Class2'], default='Class3')
print (df)

def f(x):

    if 'Sam' in x:
        return 'Class1'
    elif 'Mike' in x:
        return 'Class1'
    elif 'Andy' in x:
        return 'Class2'
    elif 'John' in x:
        return 'Class2'
    elif 'Paul' in x:
        return 'Class2'
    elif 'Jay' in x:
        return 'Class2'
    else:
        return 'Class3'

df['maker_grp'] = df['maker_nm'].apply(f)
print (df)




In [133]: %%timeit
     ...: m1 = df['maker_nm'].str.contains("Sam", regex=False)
     ...: m2 = df['maker_nm'].str.contains("Mike", regex=False)
     ...: m3 = df['maker_nm'].str.contains("Andy", regex=False)
     ...: m4 = df['maker_nm'].str.contains("John", regex=False)
     ...: m5 = df['maker_nm'].str.contains("Jay", regex=False)
     ...:
     ...: df['maker_grp'] = np.select([m1,m2,m3,m4,m5], ['Class1','Class1', 'Class2','Class2','Class2'], default='Class3')
     ...:
100 loops, best of 3: 5.79 ms per loop

In [134]: %%timeit
     ...: df['maker_grp'] = df['maker_nm'].apply(f)
     ...:
1000 loops, best of 3: 1.41 ms per loop

07-27 13:35