问题描述
我正在尝试使用python进行分类.我在网页上使用的是朴素贝叶斯MultinomialNB分类器(从Web到文本检索数据表单,稍后我将该文本分类:Web分类).
I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification).
现在,我正在尝试将PCA应用于此数据,但是python出现了一些错误.
Now, I'm trying to apply PCA on this data, but python is giving some errors.
我用于朴素贝叶斯分类的代码:
My code for classification with Naive Bayes :
from sklearn import PCA
from sklearn import RandomizedPCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
classifer = MultinomialNB(alpha=.01)
x_train = vectorizer.fit_transform(temizdata)
classifer.fit(x_train, y_train)
这个朴素的贝叶斯分类给出了输出:
This naive bayes classification gives that output :
>>> x_train
<43x4429 sparse matrix of type '<class 'numpy.int64'>'
with 6302 stored elements in Compressed Sparse Row format>
>>> print(x_train)
(0, 2966) 1
(0, 1974) 1
(0, 3296) 1
..
..
(42, 1629) 1
(42, 2833) 1
(42, 876) 1
比起尝试将PCA应用于我的数据(temizdata
):
Than I try to apply PCA on my data (temizdata
) :
>>> v_temizdata = vectorizer.fit_transform(temizdata)
>>> pca_t = PCA.fit_transform(v_temizdata)
>>> pca_t = PCA().fit_transform(v_temizdata)
但这会导致错误:
我将矩阵转换为稠密矩阵或numpy数组.然后我尝试对新的稠密矩阵进行分类,但出现错误.
I convert matrix to densematrix or numpy array. Then I tried to classfy new densematrix , but I have error.
我的主要目的是测试PCA对文本分类的影响.
My main aim is that test PCA effect on Classification on text.
转换为密集数组:
v_temizdatatodense = v_temizdata.todense()
pca_t = PCA().fit_transform(v_temizdatatodense)
最后尝试classfy:
Finally try classfy :
classifer.fit(pca_t,y_train)
最终类别错误:
一方面,我的数据(temizdata
)仅放入朴素贝叶斯(Naive Bayes),另一方面,temizdata
首先将其放入PCA(用于归约输入),而不是进行分类.__
On one side my data (temizdata
) is put in Naive Bayes only, on the other side temizdata
firstly put in PCA (for reduce inputs) than classify.__
推荐答案
我会使用scikits-learn的 TruncatedSVD
,这是一种类似于PCA的尺寸缩减算法(默认使用随机SVD)适用于稀疏数据:
Rather than converting a sparse
matrix to dense
(which is discouraged), I would use scikits-learn's TruncatedSVD
, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:
svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data)
并且,引用TruncatedSVD
文档中的内容:
And, citing from the TruncatedSVD
documentation:
这正是您的用例.
这篇关于是否可以将PCA应用于任何文本分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!