本文介绍了如何使用 Scikit Learn CountVectorizer 获取语料库中的词频?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 scikit-learn 的 CountVectorizer 计算一个简单的词频.

I'm trying to compute a simple word frequency using scikit-learn's CountVectorizer.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}

我期待它返回 {u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}.

推荐答案

cv.vocabulary_ 在这个例子中是一个字典,其中的键是你找到的词(特征)和值是索引,这就是为什么它们是 0, 1, 2, 3.它看起来与您的计数相似只是运气不好:)

cv.vocabulary_ in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3. It's just bad luck that it looked similar to your counts :)

您需要使用 cv_fit 对象来获取计数

You need to work with the cv_fit object to get the counts

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
#['bird', 'cat', 'dog', 'fish']
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

数组中的每一行都是您的原始文档(字符串)之一,每一列是一个特征(单词),元素是该特定单词和文档的计数.您可以看到,如果对每一列求和,您将得到正确的数字

Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number

print(cv_fit.toarray().sum(axis=0))
#[2 3 2 2]

老实说,我建议使用 collections.Counter 或来自 NLTK 的东西,除非您有使用 scikit-learn 的特定理由,因为它会更简单.

Honestly though, I'd suggest using collections.Counter or something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.

这篇关于如何使用 Scikit Learn CountVectorizer 获取语料库中的词频?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 13:12