假设我们具有以下数据框:

import pandas as pd
df = pd.read_csv('subjects.csv')
Col A,              Interest, Col Start, Col Go, Col Learn,
Learn English Lit
Go Mathematics
Start Science
Learn Science
Go English
Start Math
Learn Math
Go Biology
Start English


我编写了一些代码来从类似的数据集中提取兴趣,如下所示

#Map Interests
Mapper = ['English', 'Math', 'Maths', 'Mathematics', 'Biology', 'Science']
#Join Mapper to Interest Column
pat = '|'.join(r"\b{}\b".format(x) for x in Mapper)
df['interest'] = df['col A'].str.extract('('+ pat + ')', expand=False)


#Align Interest Names by creating a dict and replacing values
enter code here
d = {'English Lit' : 'English', 'Biology' : 'Science', 'Mathematics' : 'Maths'}
df['Interests'] = df['Interests'].replace(d, inplace=False)

>>> Output:

Col A,              Interest, Col Start, Col Go, Col Learn,
    Learn English Lit   English
    Go Mathematics      Maths
    Start Science       Science
    Learn Science       Science
    Go English          English
    Start Math          Maths
    Learn Math          Maths
    Go Biology          Science
    Start English       English


现在,我需要用关键字和兴趣来衡量A列的参与度。

我已按照以下步骤进行操作,但是我敢肯定有更好的方法可以执行此操作。

df['Col Start'][df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science")] = 'Learn'


另外,将多个值附加到一列中的最佳方法是什么?例如,如果我有:

Col A
Learn Science, Math, Biology.


我想将关键字+兴趣映射到一个新列,其值用逗号分隔。这是我当前的脚本分崩离析的地方,它用以前的脚本覆盖了新的值,在这里我试图捕获所有参与度(如果有意义)。

Col A                         Col B
Learn Science, Math, Biology. Learn S, Learn, M, Learn B


任何帮助将是感激的。 (请谨慎,我从2月开始编码!)

编辑清晰度:

df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science"), 'Col Start'] = 'Learn S'
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("English"), 'Col Start'] = 'Learn E'
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Math"), 'Col Start'] = 'Learn M'


Col A                Col Learn
Learn Science, Math  Learn S, Learn M
Learn Math, English  Learn M, Learn E
Learn Science        Learn S.


在我的DF中,“上校”和“兴趣”可能重叠并且具有经常性的输出。我想要的是捕获所有内容,而不是覆盖它们,而是用逗号附加任何新输入。

最佳答案

我认为如果需要通过列表理解和列表附加字符串的列表理解和findall提取所有值,则需要join
编辑:

#better is use loc for set new column
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science"), 'Col Start'] = 'Learn'

df['new'] = df['col A'].str.findall('('+ pat + ')').apply(lambda x: ', '.join(['Learn ' + y for y in x]))
print (df)

                           col A     interest Interests Col Start  \
0              Learn English Lit      English   English       NaN
1                 Go Mathematics  Mathematics     Maths       NaN
2                  Start Science      Science   Science       NaN
3                  Learn Science      Science   Science     Learn
4                     Go English      English   English       NaN
5                     Start Math         Math      Math       NaN
6                     Learn Math         Math      Math       NaN
7                     Go Biology      Biology   Science       NaN
8  Learn Science, Math, Biology.      Science   Science     Learn

                                        new
0                             Learn English
1                         Learn Mathematics
2                             Learn Science
3                             Learn Science
4                             Learn English
5                                Learn Math
6                                Learn Math
7                             Learn Biology
8  Learn Science, Learn Math, Learn Biology

10-07 14:21
查看更多