我想将行值解析为列,并使用它们在pandas数据框中查找值

尝试过的尝试和.loc索引没有成功

import pandas as pd

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO


建立玩具数据集

coltable = StringIO("""NA;NB;NC;ND;pair;desired_result
10;60;50;20;NANB;70
20;30;10;5;NANC;30
40;30;20;10;NCND;30

""")
df = pd.read_csv(coltable, sep=";")


我想访问该对的列元素(例如,第一行NA = 10和NB = 60,并使用这些值创建一个新列(desired_result = 10 + 60 = 70)。

我希望函数在pandas中创建新列以与np.vectorize兼容,因为数据集很大

像这样:

df['newcol'] = np.vectorize(myfunc)(pair=df['pair'])


非常感谢您提供的任何帮助!

最佳答案

使用DataFrame.lookup

a = df.lookup(df.index, df['pair'].str[:2])
b = df.lookup(df.index, df['pair'].str[2:])

df['new'] = a + b
print (df)
   NA  NB  NC  ND  pair  desired_result  new
0  10  60  50  20  NANB              70   70
1  20  30  10   5  NANC              30   30
2  40  30  20  10  NCND              30   30


另外,如果没有可能的缺失值,请使用列表理解或应用:

#repeat dataframe 10000 times
df = pd.concat([df] * 10000, ignore_index=True)

In [263]: %%timeit
     ...: a = df.lookup(df.index, df['pair'].str[:2])
     ...: b = df.lookup(df.index, df['pair'].str[2:])
     ...:
     ...: df['new'] = a + b
     ...:
59.5 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [264]: %%timeit
     ...: a = df.lookup(df.index, [x[:2] for x in df['pair']])
     ...: b = df.lookup(df.index, [x[2:] for x in df['pair']])
     ...:
     ...: df['new'] = a + b
     ...:
60.8 ms ± 963 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [265]: %%timeit
     ...: a = df.lookup(df.index, df['pair'].apply(lambda x: x[:2]))
     ...: b = df.lookup(df.index, df['pair'].apply(lambda x: x[2:]))
     ...:
     ...: df['new'] = a + b
     ...:
     ...:
56.6 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

08-16 20:23