问题描述
我正在处理一个非常不平衡的分类问题,我使用 AUPRC 作为插入符号中的度量.对于来自 caret 的 AUPRC 和来自包 PRROC 的 AUPRC 中的测试集,我得到了截然不同的结果.
I'm working in a very unbalanced classification problem, and I'm using AUPRC as metric in caret. I'm getting very differents results for the test set in AUPRC from caret and in AUPRC from package PRROC.
为了简单起见,可重现的示例使用来自包 mlbench 的 PimaIndiansDiabetes 数据集:
In order to make it easy, the reproducible example uses PimaIndiansDiabetes dataset from package mlbench:
rm(list=ls())
library(caret)
library(mlbench)
library(PRROC)
#load data, renaming it to 'datos'
data(PimaIndiansDiabetes)
datos=PimaIndiansDiabetes[,1:9]
# training and test
set.seed(998)
inTraining <- createDataPartition(datos[,9], p = .8, list = FALSE)
training <-datos[ inTraining,]
testing <- datos[ -inTraining,]
#training
control=trainControl(method = "cv",summaryFunction = prSummary,
classProbs = TRUE)
set.seed(998)
rf.tune <-train(training[,1:8],training[,9],method ="rf",
trControl=control,metric="AUC")
#evaluating AUPRC in test set
matriz=cbind(testing[,9],predict(rf.tune,testing[,1:8],type="prob"),
predict(rf.tune,testing[,1:8]))
names(matriz)=c("obs",levels(testing[,9]),"pred")
prSummary(matriz,levels(testing[,9]))
#calculating AUPRC through pr.curve
#checking positive class
confusionMatrix(predict(rf.tune,testing[,1:8]),testing[,9],
mode = "prec_recall")#'Positive' Class : neg
#preparing data for pr.curve
indice_POS=which(testing[,9]=="neg")
indice_NEG=which(testing[,9]=="pos")
#the classification scores of only the data points belonging to the
#positive class
clas_score_POS=predict(rf.tune,testing[,1:8],type="prob")[indice_POS,1]
#the classification scores of only the data points belonging to the
#negative class
clas_score_NEG=predict(rf.tune,testing[,1:8],type="prob")[indice_NEG,2]
pr.curve(clas_score_POS,clas_score_NEG)
来自 PRROC 的值为 0.9053432,来自插入符号 prSummary 的值为 0.8714607.在我不平衡的情况下,差异更大(AUPRC= 0.1688446 使用 SMOTE 重采样 -via control$sampling <- "smote"
- 和 0.01429 使用 PRROC.)
Value from PRROC is 0.9053432 and from caret prSummary is 0.8714607. In my unbalanced case, the differences are broader(AUPRC= 0.1688446 with SMOTE resampling -via control$sampling <- "smote"
- and 0.01429 with PRROC.)
这是因为在这些包中计算 AUPRC 的方法不同还是我做错了什么?
Is this because of the different methods to calculate AUPRC in those packages or I'm doing something wrong?
更新:我在我的代码中找不到错误.在 missuse 回答之后,我想发表一些评论:
UPDATED: I can't find bugs in my code. After missuse answer, I'd like to make some remarks:
当你做 prSummary(matriz,levels(testing[,9]))
你得到了
AUC Precision Recall F
0.8714607 0.7894737 0.9000000 0.8411215
与
confusionMatrix(predict(rf.tune,testing[,1:8]),testing[,9],mode = "prec_recall")
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 90 23
pos 10 30
Accuracy : 0.7843
95% CI : (0.7106, 0.8466)
No Information Rate : 0.6536
P-Value [Acc > NIR] : 0.0003018
Kappa : 0.4945
Mcnemar's Test P-Value : 0.0367139
Precision : 0.7965
Recall : 0.9000
F1 : 0.8451
Prevalence : 0.6536
Detection Rate : 0.5882
Detection Prevalence : 0.7386
Balanced Accuracy : 0.7330
'Positive' Class : neg
还有:
> MLmetrics::PRAUC(y_pred = matriz$neg, y_true = ifelse(matriz$obs == "neg", 1, 0))
[1] 0.8714607
正如您在最后一行中看到的,Positive"类是neg",我认为 missuse 将正类视为pos",因此我们有不同的指标.而且,当你打印训练好的 rf 时,结果也与预期的 AUC~0.87 一致:
As you can see in the last line, the 'Positive' class is 'neg', and I think that missuse is considering the positive class as 'pos', so we have different metrics. Moreover, when you print the trained rf, the results are also consistent with an expected AUC~0.87:
> rf.tune
Random Forest
615 samples
8 predictor
2 classes: 'neg', 'pos'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 554, 553, 553, 554, 554, 554, ...
Resampling results across tuning parameters:
mtry AUC Precision Recall F
2 0.8794965 0.7958683 0.8525 0.8214760
5 0.8786427 0.8048463 0.8325 0.8163032
8 0.8528028 0.8110820 0.8325 0.8192225
在这种情况下,我并不担心 0.87caret-0.9PRROC 的差异,但我非常担心不平衡情况下的 0.1688446 caret/0.01429 PRROC.这可能是因为在不平衡的情况下,不同实现下的数值差异得到了加强?如果实现上存在数字差异,那么它们在测试集中如何相同 0.8714607
?
I'm not worried about the difference 0.87caret-0.9PRROC in this case, but I'm very worried about 0.1688446 caret/ 0.01429 PRROC in the unbalanced case. Might this be because the numeric divergence under different implementations is strengthened in the unbalanced case? And if there are a numerical difference in the implementations, how's that they are identical 0.8714607
in the test set?
推荐答案
我相信您在代码中犯了几个错误.
I trust you are making several mistakes in you code.
首先caret::prSummary
使用MLmetrics::PRAUC
来计算AUPRC.应该这样定义:
First of all caret::prSummary
uses MLmetrics::PRAUC
to compute the AUPRC. It should be defined like this:
MLmetrics::PRAUC(y_pred = matriz$pos, y_true = ifelse(matriz$obs == "pos", 1, 0))
#output
0.7066323
使用正类概率和真实类的数字 0/1 向量(1 表示正类)
using the positive class probability and the numeric 0/1 vector of true classes (1 for positive)
使用得到相同的结果:
caret::prSummary(matriz, levels(testing[,9])[2])
MLmetrics::PRAUC
使用 ROCR::prediction
构建曲线:
pred_obj <- ROCR::prediction(matriz$pos, ifelse(matriz$obs == "pos", 1, 0))
perf_obj <- ROCR::performance(pred_obj, measure = "prec",
x.measure = "rec")
曲线如下:
ROCR::plot(perf_obj, ylim = c(0,1))
当使用 PRROC::pr.curve
时,有几种方法可以定义输入.一种是为正观察提供正类的概率向量,为负观察提供正类的概率向量:
when one uses PRROC::pr.curve
there are several ways to define the inputs. One is to provide a vector of probabilities for the positive class for the positive observations, and a vector of probabilities for the positive class for the negative observations:
preds <- predict(rf.tune,
testing[,1:8],
type="prob")[,2] #prob of positive class
preds_pos <- preds[testing[,9]=="pos"] #preds for true positive class
preds_neg <- preds[testing[,9]=="neg"] #preds for true negative class
PRROC::pr.curve(preds_pos, preds_neg)
#truncated output
0.7254904
这两个数字(由PRROC::pr.curve
和MLmetrics::PRAUC
获得)不匹配
these two numbers (obtained by PRROC::pr.curve
and MLmetrics::PRAUC
) do not match
然而曲线
plot(PRROC::pr.curve(preds_pos, preds_neg, curve = TRUE))
看起来就像上面使用ROCR::plot
获得的一样.
looks just like the above one obtained using ROCR::plot
.
检查:
res <- PRROC::pr.curve(preds_pos, preds_neg, curve = TRUE)
ROCR::plot(perf_obj, ylim = c(0,1), lty = 2, lwd = 2)
lines(res$curve[,1], res$curve[,2], col = "red", lty = 5)
它们是一样的.因此,获得的区域的差异是由于上述包中的不同实现造成的.
they are the same. Therefore the difference in the obtained area is due to different implementations in the mentioned packages.
这些实现可以通过查看源代码来检查:
These implementations can be checked by looking at the source for:
MLmetrics:::Area_Under_Curve #this one looks pretty straight forward
PRROC:::compute.pr #haven't the time to study this one but if I had to bet I'd say this one is more accurate for step like curves.
这篇关于插入符号中的 AUPRC 和 PRROC 之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!