本文介绍了CountVectorizer:"I"没有出现在矢量化文本中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是scikit-learn的新手,目前正在学习朴素贝叶斯(朴素贝叶斯)(多项式).现在,我正在对sklearn.feature_extraction.text中的文本进行矢量化处理,由于某种原因,当我对某些文本进行矢量化处理时,"I"一词不会出现在输出数组中.

I'm new to scikit-learn, and currently studying Naïve Bayes (Multinomial). Right now, I'm working on vectorizing text from sklearn.feature_extraction.text, and for some reason, when I vectorize some text, the word "I" doesn't show up in the outputted array.

代码:

x_train = ['I am a Nigerian hacker', 'I like puppies']

# convert x_train to vectorized text
vectorizer_train = CountVectorizer(min_df=0)
vectorizer_train.fit(x_train)
x_train_array = vectorizer_train.transform(x_train).toarray()

# print vectorized text, feature names
print x_train_array
print vectorizer_train.get_feature_names()

输出:

1 1 0 1 0
0 0 1 0 1
[u'am', u'hacker', u'like', u'nigerian', u'puppies']

为什么我"似乎没有出现在功能名称中?当我将其更改为"Ia"或类似名称时,它确实会显示.

Why doesn't "I" seem to show up in the feature names? When I change it to "Ia" or something else like that, it does show up.

推荐答案

这是由CountVectorizer的默认token_pattern引起的,它删除了单个字符的标记:

This is caused by the default token_pattern for CountVectorizer, which removes tokens of a single character:

>>> vectorizer_train
CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
>>> pattern = re.compile(vectorizer_train.token_pattern, re.UNICODE)
>>> print(pattern.match("I"))
None

要保留"I",请使用其他模式,例如

To retain "I", use a different pattern, e.g.

>>> vectorizer_train = CountVectorizer(min_df=0, token_pattern=r"\b\w+\b")
>>> vectorizer_train.fit(x_train)
CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='\\b\\w+\\b', tokenizer=None,
        vocabulary=None)
>>> vectorizer_train.get_feature_names()
[u'a', u'am', u'hacker', u'i', u'like', u'nigerian', u'puppies']

请注意,现在也保留了非信息性单词"a".

Note that the non-informative word "a" is now also retained.

这篇关于CountVectorizer:"I"没有出现在矢量化文本中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-25 07:27