我有以下 XGBoost C.V.模型。
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 20,
nfold = 3,
metrics = "auc",
verbose = TRUE,
"eval_metric" = "auc",
"objective" = "binary:logistic",
"max.depth" = 6,
"eta" = 0.01,
"subsample" = 0.5,
"colsample_bytree" = 1,
print_every_n = 1,
"min_child_weight" = 1,
booster = "gbtree",
early_stopping_rounds = 10,
watchlist = watchlist,
seed = 1234)
我的问题是关于模型的输出和
nfold
,我将 nfold
设置为 3
评估日志的输出如下所示;
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1 1 0.8852290 0.0023585703 0.8598630 0.005515424
2 2 0.9015413 0.0018569007 0.8792137 0.003765109
3 3 0.9081027 0.0014307577 0.8859040 0.005053600
4 4 0.9108463 0.0011838160 0.8883130 0.004324113
5 5 0.9130350 0.0008863908 0.8904100 0.004173123
6 6 0.9143187 0.0009514359 0.8910723 0.004372844
7 7 0.9151723 0.0010543653 0.8917300 0.003905284
8 8 0.9162787 0.0010344935 0.8929013 0.003582747
9 9 0.9173673 0.0010539116 0.8935753 0.003431949
10 10 0.9178743 0.0011498505 0.8942567 0.002955511
11 11 0.9182133 0.0010825702 0.8944377 0.003051411
12 12 0.9185767 0.0011846632 0.8946267 0.003026969
13 13 0.9186653 0.0013352629 0.8948340 0.002526793
14 14 0.9190500 0.0012537195 0.8954053 0.002636388
15 15 0.9192453 0.0010967155 0.8954127 0.002841402
16 16 0.9194953 0.0009818501 0.8956447 0.002783787
17 17 0.9198503 0.0009541517 0.8956400 0.002590862
18 18 0.9200363 0.0009890185 0.8957223 0.002580398
19 19 0.9201687 0.0010323405 0.8958790 0.002508695
20 20 0.9204030 0.0009725742 0.8960677 0.002581329
但是我设置了
nrounds = 20
但交叉验证 nfolds
= 3 那么我应该输出 60 个结果而不是 20 个结果吗?或者是上面的输出正如列名所暗示的那样,每一轮AUC的平均分数......
所以在训练集的
nround = 1
train_auc_mean
是结果 0.8852290
,这将是 3 个交叉验证 nfolds
的平均值?因此,如果我绘制这些 AUC 分数,那么我将绘制 3 倍交叉验证的平均 AUC 分数?
只是想确保一切都清楚。
最佳答案
您是正确的,输出是折叠 auc
的平均值。但是,如果您希望为最佳/最后一次迭代提取单个折叠 auc,您可以按以下步骤操作:
使用来自 mlbench
的 Sonar 数据集的示例
library(xgboost)
library(tidyverse)
library(mlbench)
data(Sonar)
xgb.train.data <- xgb.DMatrix(as.matrix(Sonar[,1:60]), label = as.numeric(Sonar$Class)-1)
param <- list(objective = "binary:logistic")
在
xgb.cv
集合中 prediction = TRUE
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 50,
early_stopping_rounds = 10,
nfold = 3,
prediction = TRUE,
eval_metric = "auc")
现在检查折叠并将预测与真实标签和相应索引连接起来:
z <- lapply(model.cv$folds, function(x){
pred <- model.cv$pred[x]
true <- (as.numeric(Sonar$Class)-1)[x]
index <- x
out <- data.frame(pred, true, index)
out
})
给出折叠名称:
names(z) <- paste("folds", 1:3, sep = "_")
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc())
#output
# A tibble: 3 x 2
id auroc
<chr> <dbl>
1 folds_1 0.944
2 folds_2 0.900
3 folds_3 0.899
这些值的平均值与最佳迭代的平均 auc 相同:
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc()) %>%
pull(auroc) %>%
mean
#output
[1] 0.9143798
model.cv$evaluation_log[model.cv$best_iteration,]
#output
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1: 48 1 0 0.91438 0.02092817
你当然可以做更多的事情,比如为每个折叠绘制 auc 曲线等等。
关于r - 了解 xgboost 交叉验证和 AUC 输出结果,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/49580910/