r - R 中的主成分分析

使用 prcomp 函数，如何使用 无监督主成分 派生自同一数据集上的数据集，分为测试和训练？

train <- sample(1:nrow(auto), 60000)
x <- as.matrix(auto[,-1])  ##Covariates
y <- auto[,1]                   ##Response
pc <- prcomp(x)             ##Find Principal Components

data <- data.frame(y=y, (x %*% pc$rotation[,1:9]))
fit <- glm(y ~ ., data=data[train,], family="binomial")   ##Train It

prediction <- predict(fit, newdata=data) > 0  ##Prediction on Entire Data Set

error <- mean(y[-train]] != prediction[-train])  ##Mean out of Sample error

最佳答案

这是一个可重现的示例:

set.seed(1)
want <- sample(50, 40)
Iris <- iris[c(51:100, 101:150), ] ## only keep versicolor and virginica
## take our training and test sets
train <- droplevels(Iris[c((1:50)[want], (51:100)[want]), , drop = FALSE])
test <- droplevels(Iris[c((1:50)[-want], (51:100)[-want]), , drop = FALSE])

## fit the PCA
pc <- prcomp(train[, 1:4])

现在请注意 pc$x 是旋转后的数据。您使用了 X %*% pc$rotation(其中 X 是训练数据矩阵)但没有先将数据居中，但它们是等效的。回归中的中心预测变量可能有用。

## create data frame for logistic regression
mydata <- data.frame(Species = train[, "Species"], pc$x)
## ...and fit the model
mod <- glm(Species ~ PC1, data = mydata, family = binomial)

预测测试集数据在PC1上的分数；也就是说，使用与用于形成训练数据的 PC 相同的旋转来旋转测试集。为此，我们可以使用 predict() 类的 "prcomp" 方法

test.p <- predict(pc, newdata = test[, 1:4])

现在用它来预测类(class)

pred <- predict(mod, newdata = data.frame(test.p), type = "response")
pred

> pred
         56          66          67          71          72
0.080427399 0.393133104 0.092661480 0.395813527 0.048277608
         74          76          82          87          95
0.226191156 0.333553423 0.003860679 0.617977807 0.029469167
        106         116         117         121         122
0.999648054 0.922145431 0.924464339 0.989271655 0.318477762
        124         126         132         137         145
0.581235903 0.995224501 0.999770995 0.964825109 0.988121496
> 1 - pred
          56           66           67           71           72
0.9195726006 0.6068668957 0.9073385196 0.6041864731 0.9517223918
          74           76           82           87           95
0.7738088439 0.6664465767 0.9961393215 0.3820221934 0.9705308332
         106          116          117          121          122
0.0003519463 0.0778545688 0.0755356606 0.0107283449 0.6815222382
         124          126          132          137          145
0.4187640970 0.0047754987 0.0002290047 0.0351748912 0.0118785036

pred 包含测试观察是 Iris virginica 的概率。请注意，在 glm() 中，当响应是一个因素(如本例中)时，该因素的第一个级别(此处为 versicolor )被视为失败或 0 ，第二个和后续级别指示成功或 1 。因为在这个例子中只有两个类，模型是根据 versicolor 参数化的； 1 - pred 将给出 virginica 的预测概率。

我不遵循您在问题中包含的错误计算，因此将其留给您解决。但是，可以通过以下方式生成模型成功的交叉分类表:

> predSpecies <- factor(ifelse(pred >= 0.5, "virginica", "versicolor"))
> table(test$Species, predSpecies)
            predSpecies
             versicolor virginica
  versicolor          9         1
  virginica           1         9

表明我们的模型有两个测试集观察错误。

关于r - R 中的主成分分析，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/10876040/