问题描述
我有以下两个数据帧 badges
和 comments
.我从 badges
数据框中创建了一个黄金用户"列表,其 Class=1
.
I have the following two dataframes badges
and comments
. I have created a list of 'gold users' from badges
dataframe whose Class=1
.
这里Name
表示徽章名称",Class
表示徽章等级(1=金,2=银,3=铜).
Here Name
means the 'Name of Badge' and Class
means the level of Badge (1=Gold, 2=Silver, 3=Bronze).
我已经对 comments['Text']
进行了文本预处理,现在想从 comments['Text']
中找到金牌用户的前 10 个单词的数量代码>.
I have already done the text preprocessing on comments['Text']
and now want to find the count of top 10 words for gold users from comments['Text']
.
我尝试了给定的代码,但出现错误
"KeyError: "[Index(['1532', '290', '1946', '1459', '6094', '766', '10446', '3106', '1',\n'1587',\n ...\n '35760', '45979', '113061', '35306', '104330', '40739', '4181', '58888',\n '2833',58158'],\n dtype='object', length=1708)] 在 [index]"中.请为我提供解决此问题的方法.
I tried the given code but am getting error
"KeyError: "None of [Index(['1532', '290', '1946', '1459', '6094', '766', '10446', '3106', '1',\n '1587',\n ...\n '35760', '45979', '113061', '35306', '104330', '40739', '4181', '58888',\n '2833', '58158'],\n dtype='object', length=1708)] are in the [index]". Please provide me a way to fix this.
注意我从 datascience.stackexchange 得到了一些答案,但它们不起作用.StackExchange 问题链接一个>
NoteI had some answers from datascience.stackexchange but they did not work. Link to StackExchange Problem
数据框 1(徽章)
Id | UserId | Name | Date |Class | TagBased
2 | 23 | Autobiographer | 2016-01-12T18:44:49.267 | 3 | False
3 | 22 | Autobiographer | 2016-01-12T18:44:49.267 | 3 | False
4 | 21 | Autobiographer | 2016-01-12T18:44:49.267 | 3 | False
5 | 20 | Autobiographer | 2016-01-12T18:44:49.267 | 3 | False
6 | 19 | Autobiographer | 2016-01-12T18:44:49.267 | 3 | False
数据框 2(评论)
Id| Text | UserId
6| [2006, course, allen, knutsons, 2001, course, ... | 3
8| [also, theo, johnsonfreyd, note, mark, haimans... | 1
代码
#Classifying Users
df_gold_users = badges[(badges['Class'] == '1')]
df_silver_users = badges[(badges['Class'] != '1') & (badges['Class'] == '2') ]
df_bronze_users = badges[(badges['Class'] != '1') & (badges['Class'] != '2') & (badges['Class'] == '3')]
gold_users = df_gold_users['UserId'].value_counts().index
silver_users = df_silver_users['UserId'].value_counts().index
bronze_users = df_bronze_users['UserId'].value_counts().index
#Text Cleaning (clean_text function tokenizes and lemmatizes)
comments['Text'] = comments['Text'].apply(lambda x: clean_text(x))
#Getting comments made by Gold Users
for index,rows in comments.iterrows():
gold_comments = rows[comments.Text.loc[gold_users]]
Counter(gold_comments)
预期产出
#Top 10 Words that appear the most in the comments made by gold users with their count.
[['scholar',20],['school',18],['bus',15],['class',14],['teacher',14],['bell',13],['time',12],['books',11],['bag',9],'student',7]]
推荐答案
import itertools
df_gold_users = badges[(badges['Class'] == '1')]
df=pd.merge(df_gold_users,comments,on='UserId')
gold_text=list(itertools.chain.from_iterable(df['Text'].to_list()))
gold_text=list(map(lambda x:[x,1],gold_text))
gold_text_df=pd.DataFrame(gold_text,columns=['Text','xyz'])
gold_text_df=gold_text_df.groupby('Text')['xyz'].count().reset_index().sort_values(by=['xyz'], ascending=False)
gold_text_df(10).values.tolist()
这篇关于如何根据条件从DataFrame中获取单词数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!