r - XGboost模型能否始终获得100％的准确性？

我正在使用Airbnb的数据(可在Kaggle上找到here)，并预测这些国家/地区的用户将使用XGBoost模型和R中的近600个功能预订首次旅行。每次100％的准确性。在将模型拟合到训练数据并根据保留的测试集进行预测之后，我还获得了100％的准确性。这些结果不是真实的。我的代码肯定有问题，但是到目前为止我还无法弄清楚。我在下面包含了我的代码的一部分。它基于此article。随文章一起进行(使用文章的数据+复制代码)，我收到类似的结果。但是，将其应用于Airbnb的数据，我始终获得100％的准确性。我不知道发生了什么事。我使用xgboost软件包的方式有误吗？感谢您的帮助和时间。

# set up the data
# train is the data frame of features with the target variable to predict
full_variables <- data.matrix(train[,-1]) # country_destination removed
full_label <- as.numeric(train$country_destination) - 1

# training data
train_index <- caret::createDataPartition(y = train$country_destination, p = 0.70, list = FALSE)
train_data <- full_variables[train_index, ]
train_label <- full_label[train_index[,1]]
train_matrix <- xgb.DMatrix(data = train_data, label = train_label)

# test data
test_data <- full_variables[-train_index, ]
test_label <- full_label[-train_index[,1]]
test_matrix <- xgb.DMatrix(data = test_data, label = test_label)

# 5-fold CV
params <- list("objective" = "multi:softprob",
               "num_class" = classes,
               eta = 0.3,
               max_depth = 6)
cv_model <- xgb.cv(params = params,
               data = train_matrix,
               nrounds = 50,
               nfold = 5,
               early_stop_round = 1,
               verbose = F,
               maximize = T,
               prediction = T)

# out of fold predictions
out_of_fold_p <- data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)
head(out_of_fold_p)

# confusion matrix
confusionMatrix(factor(out_of_fold_p$label),
                factor(out_of_fold_p$max_prob),
                mode = "everything")

通过运行以下代码，可以在此处找到我用于此目的的数据示例:

library(RCurl)
x < getURL("https://raw.githubusercontent.com/loshita/Senior_project/master/train.csv")
y <- read.csv(text = x)

最佳答案

如果您正在使用kaggle上可用的train_users_2.csv.zip，那么问题是您没有从火车数据集中删除country_destination，因为它位于16而非1位置。

which(colnames(train) == "country_destination")
#output
16

1是id，对于每个观察值都是唯一的，也应将其删除。

length(unique(train[,1)) == nrow(train)
#output
TRUE

当我使用以下修改运行您的代码时:

full_variables <- data.matrix(train[,-c(1, 16)])

  library(xgboost)

params <- list("objective" = "multi:softprob",
               "num_class" = length(unique(train_label)),
               eta = 0.3,
               max_depth = 6)
cv_model <- xgb.cv(params = params,
                   data = train_matrix,
                   nrounds = 50,
                   nfold = 5,
                   early_stop_round = 1,
                   verbose = T,
                   maximize = T,
                   prediction = T)

使用上述设置，在交叉验证0.12期间获得测试错误。

out_of_fold_p <- data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)

head(out_of_fold_p[,13:14], 20)
#output
   max_prob label
1         8     8
2        12    12
3        12    10
4        12    12
5        12    12
6        12    12
7        12    12
8        12    12
9         8     8
10       12     5
11       12     2
12        2    12
13       12    12
14       12    12
15       12    12
16        8     8
17        8     8
18       12     5
19        8     8
20       12    12

综上所述，您没有从y中删除x。

编辑:下载完真实的火车并玩转之后，我可以说5折CV中的准确度实际上是100％。不仅仅可以通过22个功能(甚至更少)来实现。

model <- xgboost(params = params,
                   data = train_matrix,
                   nrounds = 50,
                   verbose = T,
                   maximize = T)

此模型在测试集上的准确性也达到100％:

pred <- predict(model, test_matrix)
pred <- matrix(pred, ncol=length(unique(train_label)), byrow = TRUE)
out_of_fold_p <- data.frame(pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = test_label + 1)

sum(out_of_fold_p$max_prob != out_of_fold_p$label) #0 errors

现在，让我们检查哪些功能具有歧视性:

xgb.plot.importance(importance_matrix = xgb.importance(colnames(train_matrix), model))

现在，如果您仅使用以下功能运行xgb.cv:

train_matrix <- xgb.DMatrix(data = train_data[,which(colnames(train_data) %in% xgboost::xgb.importance(colnames(train_matrix), model)$Feature)], label = train_label)

set.seed(1)
cv_model <- xgb.cv(params = params,
                   data = train_matrix,
                   nrounds = 50,
                   nfold = 5,
                   early_stop_round = 1,
                   verbose = T,
                   maximize = T,
                   prediction = T)

您还将在测试褶皱上获得100％的准确性

原因部分是由于类(class)的巨大失衡:

table(train_label)
train_label
  0   1   2   3   4   5   6   7   8   9  10  11
  3  10  12  13  36  16  19 856   7  73   3 451

次类很容易通过1个虚拟变量来区分:

gg <- data.frame(train_data[,which(colnames(train_data) %in% xgb.importance(colnames(train_matrix), model)$Feature)], label = as.factor(train_label))

gg %>%
  as.tibble() %>%
  select(1:9, 11, 12, 15:21, 23) %>%
  gather(key, value, 1:18) %>%
  ggplot()+
  geom_bar(aes(x = label))+
  facet_grid(key ~ value) +
  theme(strip.text.y = element_text(angle = 90))

根据22个最重要特征的0/1分布，在我看来，如果不是100％的准确性，任何树模型都将能够实现相当好的准确性。

人们可能会期望0级和10级对于5倍CV会有问题，因为所有受试者都有可能陷入1折，因此模型至少在那种情况下不会知道它们。如果一个人通过随机采样设计CV，那将是一种可能。 xgb.cv不会发生这种情况:

lapply(cv_model$folds, function(x){
  table(train_label[x])})

关于r - XGboost模型能否始终获得100％的准确性？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/48697770/