我有一个具有A,B,C列的数据框。我想比较B和C列并创建两个列,分别为A-1(当前行年-1)和A-2(当前行年-2),其中A为年列。

示例:在1971年的数据框中,B列具有苹果,橙色
C列中只有Apple,而1970年中有B-香蕉,苹果和C-apple。
现在,我们尝试为1971年的每一行生成A-1(1970)和A-2(1969)列的值。在(A-1)1970年,B,C列均捕获了苹果,因此我们在1971年的前两行中标记为“是”,然后在第三行中将其标记为“否”,因为1970年中没有橙色。

因此,对于每一行,我们考虑年份(例如1971),并检查B和C值,看看在这种情况下,在year-1(1970)和year-2(1969)的C列中是否也捕获了该特定的B值。因为我们在数据帧中没有记录1969,所以将是nan),并相应地对其进行标记。

数据框:

 A          B          C       D
1971       apple    apple      yes
1971       apple    apple      yes
1971       orange   nan        no
1970       banana   nan        no
1970       apple    apple      yes
1972       mango    mango      yes
1972       banana   banana     yes
1972       orange   orange     yes
1972       apple    apple      yes
1973       banana    nan       no
1973       mango     mango     yes
1973       apple     nan       no
1974       orange    nan       no


输出:

 A          B          C       A-1    A-2
1971       apple    apple      yes    nan
1971       apple    apple      yes    nan
1971       orange   nan        no     nan
1970       banana   nan        nan    nan
1970       apple    apple      nan    nan
1972       mango    mango      no     no
1972       banana   banana     no     no
1972       orange   orange     no     no
1972       apple    apple      yes    yes
1973       banana    nan       yes    no
1973       mango     mango     yes    no
1973       apple     nan       yes    yes
1974       orange    nan       no     yes


我不知道,请帮助我。

最佳答案

import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame({'A': [1971, 1971, 1971, 1970, 1970, 1972, 1972, 1972, 1972, 1973, 1973, 1973, 1974], 'B': ['apple', 'apple', 'orange', 'banana', 'apple', 'mango', 'banana', 'orange', 'apple', 'banana', 'mango', 'apple', 'orange'], 'C': ['apple', 'apple', nan, nan, 'apple', 'mango', 'banana', 'orange', 'apple', nan, 'mango', nan, nan]})

# add an index column to the DataFrame
df = df.reset_index()
df['BC'] = np.where(df['B'] == df['C'], df['B'], nan)
A_min = df['A'].min()


for i in [1, 2]:
    col = 'A-{}'.format(i)
    col2 = 'Y+{}'.format(i)
    df[col2] = df['A']+i
    # fill with nans
    df[col] = nan
    # place 'no' except where there is no data for the year A-i
    mask = df['A']-i >= A_min
    df.loc[mask, col] = 'no'

    # place 'yes' where 'A','B' columns match 'Y+i','BC' columns
    match = pd.merge(df[['A','B','index']], df[[col2, 'BC']],
                     left_on=['A','B'], right_on=[col2,'BC'])
    df.loc[match['index'], col] = 'yes'

df = df.drop(['index', 'BC', 'Y+1', 'Y+2'], axis=1)
print(df)


产量

       A       B       C  A-1  A-2
0   1971   apple   apple  yes  NaN
1   1971   apple   apple  yes  NaN
2   1971  orange     NaN   no  NaN
3   1970  banana     NaN  NaN  NaN
4   1970   apple   apple  NaN  NaN
5   1972   mango   mango   no   no
6   1972  banana  banana   no   no
7   1972  orange  orange   no   no
8   1972   apple   apple  yes  yes
9   1973  banana     NaN  yes   no
10  1973   mango   mango  yes   no
11  1973   apple     NaN  yes  yes
12  1974  orange     NaN   no  yes




怎么运行的:

首先,让我们向DataFrame添加一个索引列。目的将在以后变得更清楚。 (请注意,我在这里假设您的DataFrame的原始索引具有唯一值。稍后我们将依赖该属性...)

df = df.reset_index()
#     index     A       B       C
# 0       0  1971   apple   apple
# 1       1  1971   apple   apple
# 2       2  1971  orange     NaN
# 3       3  1970  banana     NaN
# 4       4  1970   apple   apple
# 5       5  1972   mango   mango
# 6       6  1972  banana  banana
# 7       7  1972  orange  orange
# 8       8  1972   apple   apple
# 9       9  1973  banana     NaN
# 10     10  1973   mango   mango
# 11     11  1973   apple     NaN
# 12     12  1974  orange     NaN


由于我们要标识在BC列中具有相同值的特定行,因此让我们创建一个BC列,当BB相等时,该列等于C ,而NaN不在时:

In [123]: df['BC'] = np.where(df['B'] == df['C'], df['B'], nan)
In [124]: df
Out[124]:
    index     A       B       C      BC
0       0  1971   apple   apple   apple
1       1  1971   apple   apple   apple
2       2  1971  orange     NaN     NaN
3       3  1970  banana     NaN     NaN
4       4  1970   apple   apple   apple
5       5  1972   mango   mango   mango
6       6  1972  banana  banana  banana
7       7  1972  orange  orange  orange
8       8  1972   apple   apple   apple
9       9  1973  banana     NaN     NaN
10     10  1973   mango   mango   mango
11     11  1973   apple     NaN     NaN
12     12  1974  orange     NaN     NaN


现在,我们将匹配不同年份的行,因此让我们添加一列以固定我们有兴趣比较的年份。例如,我们希望将A为1971的行与Y+1等于1971的行进行比较:

In [125]: df['Y+1'] = df['A']+1; df
Out[125]:
    index     A       B       C      BC   Y+1
0       0  1971   apple   apple   apple  1972
1       1  1971   apple   apple   apple  1972
2       2  1971  orange     NaN     NaN  1972
3       3  1970  banana     NaN     NaN  1971
4       4  1970   apple   apple   apple  1971
5       5  1972   mango   mango   mango  1973
6       6  1972  banana  banana  banana  1973
7       7  1972  orange  orange  orange  1973
8       8  1972   apple   apple   apple  1973
9       9  1973  banana     NaN     NaN  1974
10     10  1973   mango   mango   mango  1974
11     11  1973   apple     NaN     NaN  1974
12     12  1974  orange     NaN     NaN  1975


通过此设置,我们可以通过将df与自身合并来标识应标记为“是”的行,
将列AB与列Y+1BC匹配:

In [127]: pd.merge(df[['A','B','index']], df[['Y+1', 'BC']], left_on=['A','B'], right_on=['Y+1','BC'])
Out[127]:
      A       B  index   Y+1      BC
0  1971   apple      0  1971   apple
1  1971   apple      1  1971   apple
2  1972   apple      8  1972   apple
3  1972   apple      8  1972   apple
4  1973  banana      9  1973  banana
5  1973   mango     10  1973   mango
6  1973   apple     11  1973   apple


请注意,index列指示在yes列中应包含A-1的行索引。这是使用上面的df = df.reset_index()的目的。没有它,合并时我们将失去原始索引。

10-07 12:34