问题描述
据我所知,我认为 PCA 只能对连续特征执行.但是在尝试了解 onehot 编码和标签编码之间的区别时,通过以下链接中的帖子:
In my understanding, I thought PCA can be performed only for continuous features. But while trying to understand the difference between onehot encoding and label encoding came through a post in the following link:
何时使用 One Hot Encoding vsLabelEncoder 与 DictVectorizor?
它指出在 PCA 之后进行一次热编码是一种非常好的方法,这基本上意味着 PCA 应用于分类特征.因此感到困惑,请建议我.
It states that one hot encoding followed by PCA is a very good method, which basically means PCA is applied for categorical features.Hence confused, please suggest me on the same.
推荐答案
我不同意其他人的观点.
I disagree with the others.
虽然您可以对二进制数据使用 PCA(例如单热编码数据),但这并不意味着它是一件好事,或者它会工作得很好.
While you can use PCA on binary data (e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well.
PCA 是为连续变量设计的.它试图最小化方差(=平方偏差).当您有二元变量时,平方偏差的概念就会失效.
PCA is designed for continuous variables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have binary variables.
所以是的,您可以使用 PCA.是的,你会得到一个输出.它甚至是最小二乘输出:PCA 不会在此类数据上出现段错误.它有效,但它的意义比您希望的要少得多;并且据说没有例如有意义频繁模式挖掘.
So yes, you can use PCA. And yes, you get an output. It even is a least-squared output: it's not as if PCA would segfault on such data. It works, but it is just much less meaningful than you'd want it to be; and supposedly less meaningful than e.g. frequent pattern mining.
这篇关于PCA 对于分类特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!