我想将行值解析为列,并使用它们在pandas数据框中查找值
尝试过的尝试和.loc索引没有成功
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
建立玩具数据集
coltable = StringIO("""NA;NB;NC;ND;pair;desired_result
10;60;50;20;NANB;70
20;30;10;5;NANC;30
40;30;20;10;NCND;30
""")
df = pd.read_csv(coltable, sep=";")
我想访问该对的列元素(例如,第一行NA = 10和NB = 60,并使用这些值创建一个新列(desired_result = 10 + 60 = 70)。
我希望函数在pandas中创建新列以与np.vectorize兼容,因为数据集很大
像这样:
df['newcol'] = np.vectorize(myfunc)(pair=df['pair'])
非常感谢您提供的任何帮助!
最佳答案
a = df.lookup(df.index, df['pair'].str[:2])
b = df.lookup(df.index, df['pair'].str[2:])
df['new'] = a + b
print (df)
NA NB NC ND pair desired_result new
0 10 60 50 20 NANB 70 70
1 20 30 10 5 NANC 30 30
2 40 30 20 10 NCND 30 30
另外,如果没有可能的缺失值,请使用列表理解或应用:
#repeat dataframe 10000 times
df = pd.concat([df] * 10000, ignore_index=True)
In [263]: %%timeit
...: a = df.lookup(df.index, df['pair'].str[:2])
...: b = df.lookup(df.index, df['pair'].str[2:])
...:
...: df['new'] = a + b
...:
59.5 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %%timeit
...: a = df.lookup(df.index, [x[:2] for x in df['pair']])
...: b = df.lookup(df.index, [x[2:] for x in df['pair']])
...:
...: df['new'] = a + b
...:
60.8 ms ± 963 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [265]: %%timeit
...: a = df.lookup(df.index, df['pair'].apply(lambda x: x[:2]))
...: b = df.lookup(df.index, df['pair'].apply(lambda x: x[2:]))
...:
...: df['new'] = a + b
...:
...:
56.6 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)