我是Python编程的新手。我想获取此Wikipedia数据集(people_wiki.csv)中每个单词的单词计数。我能够获取每个单词,并且它作为字典出现,但是无法将字典键值对拆分为单独的列。我尝试了几种方法(from_dict,from_records,to_frame,pivot_table等),这在python中可行吗?我将不胜感激。

Samle数据集:

URI                                           name             text

<http://dbpedia.org/resource/George_Clooney>  George Clooney   'george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his...'


我试过了:

clooney_word_count_table = pd.DataFrame.from_dict(clooney['word_count'], orient='index', columns=['word','count']


我也尝试过:

clooney['word_count'].to_frame()


这是我的代码:

people = pd.read_csv("people_wiki.csv")
clooney = people[people['name'] == 'George Clooney']

from collections import Counter
clooney['word_count']= clooney['text'].apply(lambda x: Counter(x.split(' ')))

clooney_word_count_table = pd.DataFrame.from_dict(clooney['word_count'], orient='index', columns=['word','count']
clooney _word_count_table


输出:

       word_count
35817   {'george': 1, 'timothy': 1, 'clooney': 9, 'ii': ...


我希望从clooney_word_count_table获得带有2列的输出数据框:

word      count
normalize  1
george     3
combat     1
producer   2

最佳答案

问题在于clooney是一个DataFrame(包含带有索引35817的一行),因此clooney['word_count']是一个Series包含一个索引35817的值(您的计数字典)。

然后DataFrame.from_dict将此系列视为与{35817: {'george': 1,...}等效,这给您带来混乱的结果。

将其修改为您的示例,并假设您要在许多条目上产生组合的字数统计:

from collections import Counter
import pandas as pd

# Load the wikipedia entries and select the ones we care about
people = pd.read_csv("people_wiki.csv")
people_to_process = people[people['name'] == 'George Clooney']

# Compute the counts for these entries
counts = Counter()
people_to_process['text'].apply(lambda text: counts.update(text.split(' ')))

# Transform the counter into a DataFrame
count_table = pd.DataFrame.from_dict(counts, orient='index', columns=['count'])
count_table

10-05 22:07