我试图同时使用counts和tfidf作为多项式nb模型的特征。这是我的代码:
text = ["this is spam", "this isn't spam"]
labels = [0,1]
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfTransformer(use_idf=True)
combined_features = FeatureUnion([("counts", self.count_vectorizer), ("tfidf", tf_transformer)]).fit(self.text)
classifier = MultinomialNB()
classifier.fit(combined_features, labels)
但我在FeatureUnion和TFIDF中遇到了一个错误:
TypeError: no supported conversion for types: (dtype('S18413'),)
知道为什么会这样吗?计数和tfidf不能同时作为功能吗?
最佳答案
错误不是来自FeatureUnion
,而是来自TfidfTransformer
您应该使用TfidfVectorizer
而不是TfidfTransformer
,转换器需要一个numpy数组作为输入,而不是纯文本,因此类型错误
另外,对于tfidf测试来说,您的测试语句太小,因此请尝试使用较大的语句,下面是一个示例:
from nltk.corpus import brown
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import MultinomialNB
# Let's get more text from NLTK
text = [" ".join(i) for i in brown.sents()[:100]]
# I'm just gonna assign random tags.
labels = ['yes']*50 + ['no']*50
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfVectorizer(use_idf=True)
combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text)
classifier = MultinomialNB()
classifier.fit(combined_features, labels)
关于python - 使用count和tfidf作为scikit学习的功能,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/27260799/