我目前正在处理这样的数据框:

 words:                               other:   category:
 hello, jim, you, you , jim            val1      movie
 it, seems, bye, limb, pat, paddy      val2      movie
 how, are, you, are , kim              val1      television
 ......
 ......


我正在尝试计算“类别”列中每个类别的前10个最常出现的单词和双字母组。虽然,我想在将最常见的二元组分组到各自类别之前对其进行计算。

我的问题是,如果我按类别分组,然后获得最常出现的前10个双字母组,则第一行的单词将与第二行合并。

二元组应如下所示:

 (hello, jim), (jim, you), (you, you), (you, jim)
 (it, seems), (seems,bye), (bye, limb), (limb, pat), (pat, paddy)
 (how, are), (are, you), (you, are), (are, kim)


而如果我在获得二元组之前进行分组,那么二元组将是:

 (hello, jim), (jim, you), (you, you), (you, jim), (jim, it), (it, seems), (seems,bye), (bye, limb), (limb, pat), (pat, paddy)
 (how, are), (are, you), (you, are), (are, kim)


使用熊猫做到这一点的最佳方法是什么?

抱歉,如果我的问题不必要地复杂,我只想包括所有细节。请让我知道任何问题。

最佳答案

数据框示例:

                                   words other    category
0             hello, jim, you, you , jim  val1       movie
1  it, seems, bye, limb, pat, hello, jim  val2       movie
2               how, are, you, are , kim  val1  television


这是一种使用Pandas和.iterrows()计算双字母组的方法:

bigrams = []
for idx, row in df.iterrows():
    lst = row['words'].split(',')
    bigrams.append([(lst[x].strip(), lst[x+1].strip()) for x in range(len(lst)-1)])

print(bigrams)


[[('hello', 'jim'), ('jim', 'you'), ('you', 'you'), ('you', 'jim')],
[('it', 'seems'), ('seems', 'bye'), ('bye', 'limb'), ('limb', 'pat'), ('pat', 'hello'), ('hello', 'jim')],
[('how', 'are'), ('are', 'you'), ('you', 'are'), ('are', 'kim')]]


这是使用Pandas和.apply的更有效的方法:

def bigram(row):
    lst = row['words'].split(', ')
    return [(lst[x].strip(), lst[x+1].strip()) for x in range(len(lst)-1)]

bigrams = df.apply(lambda row: bigram(row), axis=1)

print(bigrams.tolist())


[[('hello', 'jim'), ('jim', 'you'), ('you', 'you'), ('you', 'jim')],
[('it', 'seems'), ('seems', 'bye'), ('bye', 'limb'), ('limb', 'pat'), ('pat', 'hello'), ('hello', 'jim')],
[('how', 'are'), ('are', 'you'), ('you', 'are'), ('are', 'kim')]]


然后,您可以按类别对数据进行分组,并找到最常见的10个二元组。以下是按类别查找最常见的二元组的示例:

df['bigrams'] = bigrams
df2 = df.groupby('category').agg({'bigrams': 'sum'})

# Compute the most frequent bigrams by category
from collections import Counter
df3 = df2.bigrams.apply(lambda row: Counter(row)).to_frame()


按类别分类的双峰频率字典:

print(df3)

                                                      bigrams
category
movie       {('hello', 'jim'): 2, ('jim', 'you'): 1, ('you...
television  {('how', 'are'): 1, ('are', 'you'): 1, ('you',...


# Filter to just the top 3 most frequent bigrams (or 10 if you have enough data)
df3.bigrams.apply(lambda row: list(row)[0:3])


category
movie         [(hello, jim), (jim, you), (you, you)]
television      [(how, are), (are, you), (you, are)]
Name: bigrams, dtype: object

关于python - 如何对每个 Pandas 进行分组并获得最常见的单词和双字母,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/55348500/

10-10 04:00