如何根据条件从DataFrame中获取单词数

本文介绍了如何根据条件从DataFrame中获取单词数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下两个数据帧 badges 和 comments.我从 badges 数据框中创建了一个黄金用户"列表，其 Class=1.

I have the following two dataframes badges and comments. I have created a list of 'gold users' from badges dataframe whose Class=1.

这里Name表示徽章名称"，Class表示徽章等级(1=金，2=银，3=铜).

Here Name means the 'Name of Badge' and Class means the level of Badge (1=Gold, 2=Silver, 3=Bronze).

我已经对 comments['Text'] 进行了文本预处理，现在想从 comments['Text'] 中找到金牌用户的前 10 个单词的数量代码>.

I have already done the text preprocessing on comments['Text']and now want to find the count of top 10 words for gold users from comments['Text'].

我尝试了给定的代码，但出现错误
"KeyError: "[Index(['1532', '290', '1946', '1459', '6094', '766', '10446', '3106', '1',\n'1587',\n ...\n '35760', '45979', '113061', '35306', '104330', '40739', '4181', '58888',\n '2833',58158'],\n dtype='object', length=1708)] 在 [index]"中.请为我提供解决此问题的方法.

I tried the given code but am getting error
"KeyError: "None of [Index(['1532', '290', '1946', '1459', '6094', '766', '10446', '3106', '1',\n '1587',\n ...\n '35760', '45979', '113061', '35306', '104330', '40739', '4181', '58888',\n '2833', '58158'],\n dtype='object', length=1708)] are in the [index]". Please provide me a way to fix this.

注意我从 datascience.stackexchange 得到了一些答案，但它们不起作用.StackExchange 问题链接

NoteI had some answers from datascience.stackexchange but they did not work. Link to StackExchange Problem

数据框 1(徽章)

   Id | UserId |  Name          |        Date              |Class | TagBased
   2  | 23     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   3  | 22     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   4  | 21     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   5  | 20     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   6  | 19     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False

数据框 2(评论)

   Id|                    Text                             |    UserId
    6|  [2006, course, allen, knutsons, 2001, course, ...  |    3
    8|  [also, theo, johnsonfreyd, note, mark, haimans...  |    1

代码

#Classifying Users
df_gold_users = badges[(badges['Class'] == '1')]
df_silver_users = badges[(badges['Class'] != '1') & (badges['Class'] == '2') ]
df_bronze_users = badges[(badges['Class'] != '1') & (badges['Class'] != '2') & (badges['Class'] == '3')]

gold_users = df_gold_users['UserId'].value_counts().index
silver_users = df_silver_users['UserId'].value_counts().index
bronze_users = df_bronze_users['UserId'].value_counts().index

#Text Cleaning (clean_text function tokenizes and lemmatizes)
comments['Text'] = comments['Text'].apply(lambda x: clean_text(x))

#Getting comments made by Gold Users
for index,rows in comments.iterrows():
  gold_comments = rows[comments.Text.loc[gold_users]]
  Counter(gold_comments)

预期产出

#Top 10 Words that appear the most in the comments made by gold users with their count.
 [['scholar',20],['school',18],['bus',15],['class',14],['teacher',14],['bell',13],['time',12],['books',11],['bag',9],'student',7]]

gold

如何根据条件从DataFrame中获取单词数

问题描述

推荐答案