本文介绍了计算 pandas 数据框中单词的出现频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个如下表:
URN Firm_Name
0 104472 R.X. Yah & Co
1 104873 Big Building Society
2 109986 St James's Society
3 114058 The Kensington Society Ltd
4 113438 MMV Oil Associates Ltd
我想计算Firm_Name列中所有单词的出现频率,以获得如下输出:
And I want to count the frequency of all the words within the Firm_Name column, to get an output like below:
我尝试了以下代码:
import pandas as pd
import nltk
data = pd.read_csv("X:\Firm_Data.csv")
top_N = 20
word_dist = nltk.FreqDist(data['Firm_Name'])
print('All frequencies')
print('='*60)
rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])
print(rslt)
print ('='*60)
但是,以下代码不会产生唯一的字数.
However the following code does not produce a unique word count.
推荐答案
IIUIC,使用value_counts()
In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
Out[3361]:
Society 3
Ltd 2
James's 1
R.X. 1
Yah 1
Associates 1
St 1
Kensington 1
MMV 1
Big 1
& 1
The 1
Co 1
Oil 1
Building 1
dtype: int64
或者,
Or,
pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()
或者,
Or,
pd.Series(' '.join(df.Firm_Name).split()).value_counts()
对于前N个,例如3
For top N, for example 3
In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
Out[3379]:
Society 3
Ltd 2
James's 1
dtype: int64
详细信息
Details
In [3380]: df
Out[3380]:
URN Firm_Name
0 104472 R.X. Yah & Co
1 104873 Big Building Society
2 109986 St James's Society
3 114058 The Kensington Society Ltd
4 113438 MMV Oil Associates Ltd
这篇关于计算 pandas 数据框中单词的出现频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!