本文介绍了是否可以将PCA应用于任何文本分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python进行分类.我在网页上使用的是朴素贝叶斯MultinomialNB分类器(从Web到文本检索数据表单,稍后我将该文本分类:Web分类).

I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification).

现在,我正在尝试将PCA应用于此数据,但是python出现了一些错误.

Now, I'm trying to apply PCA on this data, but python is giving some errors.

我用于朴素贝叶斯分类的代码:

My code for classification with Naive Bayes :

from sklearn import PCA
from sklearn import RandomizedPCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
classifer = MultinomialNB(alpha=.01)

x_train = vectorizer.fit_transform(temizdata)
classifer.fit(x_train, y_train)

这个朴素的贝叶斯分类给出了输出:

This naive bayes classification gives that output :

>>> x_train
<43x4429 sparse matrix of type '<class 'numpy.int64'>'
    with 6302 stored elements in Compressed Sparse Row format>

>>> print(x_train)
(0, 2966)   1
(0, 1974)   1
(0, 3296)   1
..
..
(42, 1629)  1
(42, 2833)  1
(42, 876)   1

比起尝试将PCA应用于我的数据(temizdata):

Than I try to apply PCA on my data (temizdata) :

>>> v_temizdata = vectorizer.fit_transform(temizdata)
>>> pca_t = PCA.fit_transform(v_temizdata)
>>> pca_t = PCA().fit_transform(v_temizdata)

但这会导致错误:

我将矩阵转换为稠密矩阵或numpy数组.然后我尝试对新的稠密矩阵进行分类,但出现错误.

I convert matrix to densematrix or numpy array. Then I tried to classfy new densematrix , but I have error.

我的主要目的是测试PCA对文本分类的影响.

My main aim is that test PCA effect on Classification on text.

转换为密集数组:

v_temizdatatodense = v_temizdata.todense()
pca_t = PCA().fit_transform(v_temizdatatodense)

最后尝试classfy:

Finally try classfy :

classifer.fit(pca_t,y_train)

最终类别错误:

一方面,我的数据(temizdata)仅放入朴素贝叶斯(Naive Bayes),另一方面,temizdata首先将其放入PCA(用于归约输入),而不是进行分类.__

On one side my data (temizdata) is put in Naive Bayes only, on the other side temizdata firstly put in PCA (for reduce inputs) than classify.__

推荐答案

我会使用scikits-learn的 TruncatedSVD ,这是一种类似于PCA的尺寸缩减算法(默认使用随机SVD)适用于稀疏数据:

Rather than converting a sparse matrix to dense (which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:

svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data)

并且,引用TruncatedSVD文档中的内容:

And, citing from the TruncatedSVD documentation:

这正是您的用例.

这篇关于是否可以将PCA应用于任何文本分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 18:44