问题描述
我正在处理使用 caret
包生成的 SVM-RFE 模型的交叉验证数据(10 倍重复 5 次).我知道在计算指标时 caret
包与 pROC
包一起工作,但我需要使用 ROCR
包以获得平均 ROC.但是,我注意到使用每个包时平均 AUC 值并不相同,所以我不确定是否应该模糊地使用这两个包.
I am working with cross-validation data (10-fold repeated 5 times) from a SVM-RFE model generated with the caret
package. I know that caret
package works with pROC
package when computing metrics but I need to use ROCR
package in order to obtain the average ROC. However, I noticed that the average AUC values were not the same when using each package, so I am not sure if I should use both packages indistinctively.
我用来证明的代码是:
predictions_NG3<-list()
labels_NG3<-list()
optSize <- svmRFE_NG3$optsize
resamples<-(split(svmRFE_NG3$pred,svmRFE_NG3$pred$Variables))
resamplesFOLD<-(split(resamples[[optSize]],resamples[[optSize]]$Resample))
auc_pROC <- vector()
auc_ROCR <- vector()
for (i in 1:50){
predictions_NG3[[i]]<-resamplesFOLD[[i]]$LUNG
labels_NG3[[i]]<-resamplesFOLD[[i]]$obs
#WITH pROC
rocCurve <- roc(response = labels_NG3[[i]],
predictor = predictions_NG3[[i]],
levels = c("BREAST","LUNG")) #LUNG POSITIVE
auc_pROC <- c(auc_pROC,auc(rocCurve))
#WITH ROCR
pred_ROCR <- prediction(predictions_NG3[[i]], labels_NG3[[i]],
label.ordering = c("BREAST","LUNG")) #LUNG POSITIVE
auc_ROCR <- c(auc_ROCR,performance(pred_ROCR,"auc")@y.values[[1]])
}
auc_mean_pROC <- mean(auc_pROC)
auc_sd_pROC <- sd(auc_pROC)
auc_mean_ROCR <- mean(auc_ROCR)
auc_sd_ROCR <- sd(auc_ROCR)
结果略有不同:
auc_mean_pROC auc_sd_pROC auc_mean_ROCR auc_sd_ROCR
1 0.8755556 0.1524801 0.8488889 0.2072751
我注意到平均 AUC 计算在许多情况下给了我不同的结果,例如 [5]
、[22]
和 [25]代码>:
I noticed that the average AUC computation is giving me different results in many cases, like in [5]
, [22]
and [25]
:
> auc_pROC
[1] 0.8333333 0.8333333 1.0000000 1.0000000 0.6666667 0.8333333 0.3333333 0.8333333 1.0000000 1.0000000 1.0000000 1.0000000
[13] 0.8333333 0.5000000 0.8888889 1.0000000 1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 0.6666667 0.6666667 0.8888889
[25] 0.8333333 0.6666667 1.0000000 0.6666667 1.0000000 0.6666667 1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 1.0000000
[37] 0.8333333 1.0000000 0.8333333 1.0000000 0.8333333 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000
[49] 1.0000000 1.0000000
> auc_ROCR
[1] 0.8333333 0.8333333 1.0000000 1.0000000 0.3333333 0.8333333 0.3333333 0.8333333 1.0000000 1.0000000 1.0000000 1.0000000
[13] 0.8333333 0.5000000 0.8888889 1.0000000 1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 0.3333333 0.6666667 0.8888889
[25] 0.1666667 0.6666667 1.0000000 0.6666667 1.0000000 0.6666667 1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 1.0000000
[37] 0.8333333 1.0000000 0.8333333 1.0000000 0.8333333 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000
[49] 1.0000000 1.0000000
我尝试过其他 SVM-RFE 模型,但问题仍然存在.为什么会这样?我做错了什么吗?
I have tried with other SVM-RFE models, but the problem remains. Why is this happening? Am I doing something wrong?
推荐答案
默认情况下,pROC 中的 roc
函数会尝试检测控制和案例观察的响应级别(您覆盖了默认值)通过设置 levels
参数)以及控件的值是否应该高于或低于案例.您还没有使用 direction
参数来设置后者.
By default, the roc
function in pROC attempts to detect what is the response level of control and case observations (you overrode the defaults by setting the levels
argument) and whether the controls should have higher or lower values than the cases. You haven't used a direction
argument to set the latter.
当您对数据重新采样时,每个样本都会进行这种自动检测.如果您的样本量很小,或者您的 AUC 接近 0.5,则可能并且会发生一些相反方向的 ROC 曲线,从而使您的平均值偏向更高的值.
When you resample your data, this auto-detection will happen for every sample. And if your sample size is low, or your AUC close to 0.5, it can and will happen that some ROC curves will be generated with the opposite direction, biasing your average towards higher values.
因此,当您重新采样 ROC 曲线或类似曲线时,您应该始终明确设置 direction
参数,例如:
Therefore you should always set the direction
argument explicitly when you resample ROC curves or similar, for instance:
rocCurve <- roc(response = labels_NG3[[i]],
predictor = predictions_NG3[[i]],
direction = "<",
levels = c("BREAST","LUNG"))
这篇关于使用 ROCR 和 pROC (R) 计算平均 AUC 的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!