问题描述
我正在使用scikit学习中的CountVectorizer,并且我可能正在尝试做一些并非为对象而设计的东西……但我不确定.
I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure.
在获取发生次数方面:
vocabulary = ['hi', 'bye', 'run away!']
corpus = ['run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()
给予:
[[0 0 0 0]]
我意识到的是CountVectorizer会将语料库分解为我认为是字母组合的东西:
What I'm realizing is that the CountVectorizer will break the corpus into what I believe is unigrams:
vocabulary = ['hi', 'bye', 'run']
corpus = ['run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()
给出:
[[0 0 1]]
有什么方法可以准确地告诉CountVectorizer您如何对语料库进行矢量化吗?理想情况下,我希望按照第一个示例的结果.
Is there any way to tell the CountVectorizer exactly how you'd like to vectorize the corpus? Ideally I would like an outcome along the lines of the first example.
但是,老实说,我想知道是否有可能按照以下思路获得结果:
In all honestly, however, I'm wondering if it is at all possible to get an outcome along these lines:
vocabulary = ['hi', 'bye', 'run away!']
corpus = ['I want to run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()
[[0 0 1]]
我在fit_transform方法的文档中看不到太多信息,该方法仅保留一个参数.如果有人有任何想法,我将不胜感激.谢谢!
I don't see much information in the documentation for the fit_transform method, which only takes one argument as it is. If anyone has any ideas I would be grateful. Thanks!
推荐答案
所需的参数称为ngram_range
.您将元组(1,2)
传递给构造函数以获得单字组和二元组.但是,您传入的词汇表必须为dict
,其中ngrams为键,而整数为值.
The parameter you want is called ngram_range
. You pass in a tuple (1,2)
to the constructor to get unigrams and bigrams. However, the vocabulary you pass in needs to be a dict
with ngrams as keys and integers as values.
In [20]: print CountVectorizer(vocabulary={'hi': 0, u'bye': 1, u'run away': 2}, ngram_range=(1,2)).fit_transform(['I want to run away!']).A
[[0 0 1]]
请注意,默认令牌生成器会在末尾删除感叹号,因此最后一个令牌是away
.如果要进一步控制如何将字符串拆分为令牌,请遵循@BrenBarn的评论.
Note the default tokeniser removes the exclamation mark at the end, so the last token is away
. If you want more control over how the string is broken up into tokens, follow @BrenBarn's comment.
这篇关于我可以控制CountVectorizer对scikit学习中的语料库进行矢量化的方式吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!