问题描述
我正在尝试使用 scikit-learn 的 CountVectorizer
计算一个简单的词频.
I'm trying to compute a simple word frequency using scikit-learn's CountVectorizer
.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}
我期待它返回 {u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}
.
推荐答案
cv.vocabulary_
在这个例子中是一个字典,其中的键是你找到的词(特征)和值是索引,这就是为什么它们是 0, 1, 2, 3
.它看起来与您的计数相似只是运气不好:)
cv.vocabulary_
in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3
. It's just bad luck that it looked similar to your counts :)
您需要使用 cv_fit
对象来获取计数
You need to work with the cv_fit
object to get the counts
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
print(cv.get_feature_names())
print(cv_fit.toarray())
#['bird', 'cat', 'dog', 'fish']
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]
数组中的每一行都是您的原始文档(字符串)之一,每一列是一个特征(单词),元素是该特定单词和文档的计数.您可以看到,如果对每一列求和,您将得到正确的数字
Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number
print(cv_fit.toarray().sum(axis=0))
#[2 3 2 2]
老实说,我建议使用 collections.Counter
或来自 NLTK 的东西,除非您有使用 scikit-learn 的特定理由,因为它会更简单.
Honestly though, I'd suggest using collections.Counter
or something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.
这篇关于如何使用 Scikit Learn CountVectorizer 获取语料库中的词频?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!