问题描述
我已经开始使用scikit Learn进行文本提取.当我在管道中使用标准函数CountVectorizer和TfidfTransformer并尝试与新功能(矩阵的保留性)结合使用时,我遇到了行尺寸问题.
I have started to use scikit learn for text extraction.When I use standard function CountVectorizer and TfidfTransformer in a pipeline and when I try to combine with new features ( a concatention of matrix) I have got a row dimension problem.
这是我的管道:
pipeline = Pipeline([('feats', FeatureUnion([
('ngram_tfidf', Pipeline([('vect', CountVectorizer()),'tfidf', TfidfTransformer())])),
('addned', AddNed()),])), ('clf', SGDClassifier()),])
这是我的类AddNEd,它在每个文档(样本)上添加30个新闻功能.
This is my class AddNEd which add 30 news features on each documents (sample).
class AddNed(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def transform (self, X, **transform_params):
do_something
x_new_feat = np.array(list_feat)
print(type(X))
X_np = np.array(X)
print(X_np.shape, x_new_feat.shape)
return np.concatenate((X_np, x_new_feat), axis = 1)
def fit(self, X, y=None):
return self
还有我的主程序的第一部分
And the first part of my main programm
data = load_files('HO_without_tag')
grid_search = GridSearchCV(pipeline, parameters, n_jobs = 1, verbose = 20)
print(len(data.data), len(data.target))
grid_search.fit(X, Y).transform(X)
但是我得到了这个结果:
But I get this result:
486 486
Fitting 3 folds for each of 3456 candidates, totalling 10368 fits
[CV]feats__ngram_tfidf__vect__max_features=3000....
323
<class 'list'>
(323,) (486, 30)
当然还有Indexerror异常
And of course a Indexerror Exception
return np.concatenate((X_np, x_new_feat), axis = 1)
IndexError: axis 1 out of bounds [0, 1
当我在转换函数(类AddNed)中具有参数X时,为什么我没有X的numpy数组(486,3000)形状.我只有(323,)形状.我不明白,因为如果删除Feature Union和AddNed()管道,则CountVectorizer和tf_idf可以正确使用正确的特征和正确的形状.如果有人有主意?非常感谢.
When I have the params X in transform function (class AddNed) why I don't have a numpy array (486, 3000) shape for X. I have only (323,) shape. I don't understand because if I delete Feature Union and AddNed() pipeline, CountVectorizer and tf_idf work properly with the right features and the right shape.If anyone have an idea?Thanks a lot.
推荐答案
您可能已经解决了,但是其他人可能也有相同的问题:
You've probably solved it by now, but someone else may have the same problem:
(323, 3000) # X shape Matrix
<class 'scipy.sparse.csr.csr_matrix'>
AddNed
尝试将一个矩阵与稀疏矩阵连接起来,应首先将稀疏矩阵转换为稠密矩阵.我在尝试使用CountVectorizer
AddNed
tries to concatenate a matrix with a sparse matrix, the sparse matrix should be transformed to dense matrix first.I've found the same error trying to use the result of CountVectorizer
这篇关于scikit klearn中的FeatureUnion和不兼容的行尺寸的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!