问题描述
我有一个回归模型,其中因变量是连续的,但 90% 的自变量是分类的(有序和无序的),并且大约 30% 的记录有缺失值(更糟糕的是,它们随机丢失了任何模式,即超过 45% 的数据至少有一个缺失值).没有先验理论来选择模型的规范,因此关键任务之一是在运行回归之前降维.虽然我知道几种连续变量降维的方法,但我不知道分类数据的类似静态文献(除了作为对应分析的一部分,它基本上是频率表上的主成分分析的变体).我还要补充一点,数据集大小适中,包含 500000 个观测值和 200 个变量.我有两个问题.
I have a regression model in which the dependent variable is continuous but ninety percent of the independent variables are categorical(both ordered and unordered) and around thirty percent of the records have missing values(to make matters worse they are missing randomly without any pattern, that is, more that forty five percent of the data hava at least one missing value). There is no a priori theory to choose the specification of the model so one of the key tasks is dimension reduction before running the regression. While I am aware of several methods for dimension reduction for continuous variables I am not aware of a similar statical literature for categorical data (except, perhaps, as a part of correspondence analysis which is basically a variation of principal component analysis on frequency table). Let me also add that the dataset is of moderate size 500000 observations with 200 variables. I have two questions.
- 是否有关于分类数据降维和稳健插补的良好统计参考(我认为第一个问题是插补,然后是降维)?
- 这与上述问题的实现有关.我之前广泛使用过 R,并且倾向于大量使用 transcan 和插补函数来处理连续变量,并使用树方法的变体来插补分类值.我有 Python 的工作知识,所以如果有什么好东西可以用于这个目的,那么我会使用它.python 或 R 中的任何实现指针都会有很大帮助.谢谢.
推荐答案
关于分类数据的估算,我建议查看 鼠标 包.也看看这个 presentation 解释了它如何估算多变量分类数据.另一个用于不完全多元数据多重插补的软件包是 Amelia.Amelia 处理有序变量和名义变量的能力有限.
Regarding imputation of categorical data, I would suggest to check the mice package. Also take a look at this presentation which explains how it imputes multivariate categorical data. Another package for Mutliple Imputation of Incomplete Multivariate Data is Amelia. Amelia includes some limited capacity to deal with ordinal and nominal variables.
对于分类数据的降维(即一种将变量排列成同类簇的方法),我建议使用多重对应分析,它将为您提供最大化聚类同质性的潜在变量.与在主成分分析 (PCA) 和因子分析中所做的类似,MCA 解决方案也可以旋转以增加成分的简单性.旋转背后的想法是找到与旋转组件更清晰一致的变量子集.这意味着最大化组件的简单性有助于因子解释和变量聚类.R MCA 方法包含在包 ade4、MASS、FactoMineR 和 ca(至少).至于 FactoMineR,如果你将它作为额外菜单添加到 Rcmdr 包已经提出的菜单中,你可以通过图形界面使用它,安装 RcmdrPlugin.FactoMineR
As for dimensionality reduction for categorical data (i.e. a way to arrange variables into homogeneous clusters), I would suggest the method of Multiple Correspondence Analysis which will give you the latent variables that maximize the homogeneity of the clusters. Similarly to what is done in Principal Component Analysis (PCA) and Factor Analysis, the MCA solution can also be rotated to increase the components simplicity. The idea behind a rotation is to find subsets of variables which coincide more clearly with the rotated components. This implies that maximizing components simplicity can help in factor interpretation and in variables clustering. In R MCA methods are included in packages ade4, MASS, FactoMineR and ca (at least). As for FactoMineR, you can use it through a graphical interface if you add it as an extra menu to the ones already proposed by the Rcmdr package, installing the RcmdrPlugin.FactoMineR
这篇关于具有缺失值的分类数据的降维的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!