假设我们具有以下数据框:
import pandas as pd df = pd.read_csv('subjects.csv') Col A, Interest, Col Start, Col Go, Col Learn, Learn English Lit Go Mathematics Start Science Learn Science Go English Start Math Learn Math Go Biology Start English
我编写了一些代码来从类似的数据集中提取兴趣,如下所示
#Map Interests Mapper = ['English', 'Math', 'Maths', 'Mathematics', 'Biology', 'Science'] #Join Mapper to Interest Column pat = '|'.join(r"\b{}\b".format(x) for x in Mapper) df['interest'] = df['col A'].str.extract('('+ pat + ')', expand=False) #Align Interest Names by creating a dict and replacing values enter code here d = {'English Lit' : 'English', 'Biology' : 'Science', 'Mathematics' : 'Maths'} df['Interests'] = df['Interests'].replace(d, inplace=False) >>> Output: Col A, Interest, Col Start, Col Go, Col Learn, Learn English Lit English Go Mathematics Maths Start Science Science Learn Science Science Go English English Start Math Maths Learn Math Maths Go Biology Science Start English English
现在,我需要用关键字和兴趣来衡量A列的参与度。
我已按照以下步骤进行操作,但是我敢肯定有更好的方法可以执行此操作。
df['Col Start'][df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science")] = 'Learn'
另外,将多个值附加到一列中的最佳方法是什么?例如,如果我有:
Col A Learn Science, Math, Biology.
我想将关键字+兴趣映射到一个新列,其值用逗号分隔。这是我当前的脚本分崩离析的地方,它用以前的脚本覆盖了新的值,在这里我试图捕获所有参与度(如果有意义)。
Col A Col B Learn Science, Math, Biology. Learn S, Learn, M, Learn B
任何帮助将是感激的。 (请谨慎,我从2月开始编码!)
编辑清晰度:
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science"), 'Col Start'] = 'Learn S' df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("English"), 'Col Start'] = 'Learn E' df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Math"), 'Col Start'] = 'Learn M' Col A Col Learn Learn Science, Math Learn S, Learn M Learn Math, English Learn M, Learn E Learn Science Learn S.
在我的DF中,“上校”和“兴趣”可能重叠并且具有经常性的输出。我想要的是捕获所有内容,而不是覆盖它们,而是用逗号附加任何新输入。
最佳答案
我认为如果需要通过列表理解和列表附加字符串的列表理解和findall提取所有值,则需要join
编辑:
#better is use loc for set new column df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science"), 'Col Start'] = 'Learn' df['new'] = df['col A'].str.findall('('+ pat + ')').apply(lambda x: ', '.join(['Learn ' + y for y in x])) print (df) col A interest Interests Col Start \ 0 Learn English Lit English English NaN 1 Go Mathematics Mathematics Maths NaN 2 Start Science Science Science NaN 3 Learn Science Science Science Learn 4 Go English English English NaN 5 Start Math Math Math NaN 6 Learn Math Math Math NaN 7 Go Biology Biology Science NaN 8 Learn Science, Math, Biology. Science Science Learn new 0 Learn English 1 Learn Mathematics 2 Learn Science 3 Learn Science 4 Learn English 5 Learn Math 6 Learn Math 7 Learn Biology 8 Learn Science, Learn Math, Learn Biology