问题描述
在我的工作项目中,我使用插入符号包中的rfe函数来进行递归功能消除.我用一个玩具例子来说明我的观点.
In my working project, I use rfe function from caret package to do recursive feature elimination. I use a toy example to illustrate my point.
library(mlbench)
library(caret)
data(PimaIndiansDiabetes)
rfFuncs$summary <- twoClassSummary
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control, metric="ROC")
所选的最佳变量基于过程中具有最高auroc的那些变量,并且可以通过 results $ optVariables
进行检索.但是,我想做的是使用"1个标准错误规则"来选择较少的功能(下面的代码).识别出的变量数为4.
The optimal variable selected is based on those variables that give highest auroc in the process and can be retrieved by results$optVariables
. However, what I want to do is use '1 standard error rule' to select less features (code below). The number of variables identified is 4.
# auc that is 1-se from the highest auc
df.results = results$results %>% dplyr::mutate(ROCSE = ROCSD/sqrt(10-1))
idx = which.max(df.results$ROC)
ROC.1se = df.results$ROC[idx] - df.results$ROCSE[idx]
# plot ROC vs feature size
g = ggplot(df.results, aes(x=Variables, y=ROC)) +
geom_errorbar(aes(ymin=ROC-ROCSE, ymax=ROC+ROCSE),
width=.2, alpha=0.4, linetype=1) +
geom_line() +
geom_point()+
scale_color_brewer(palette="Paired")+
geom_hline(yintercept = ROC.1se)+
labs(x ="Number of Variables", y = "AUROC")
print(g)
我确定的变量数为4.现在,我需要知道哪个四个变量.我在下面做了:
The number of variables I identified is 4. Now I need to know which four variables. I did below:
results$variables %>% filter(Variables==4) %>% distinct(var)
它向我显示5个变量!
有人知道我如何检索这些变量吗?基本上,它适用于获取任意数量的所选变量的那些变量.
Does anyone know how I can retrieve those variables? Basically it applies to get those variables for any number of variables selected.
非常感谢!
推荐答案
单行答案
如果您知道只希望从rfe重采样中获得最好的4个变量,则将为您提供所需的内容.
If you know you want only the best 4 variables from the rfe resampling, this will give you what you are looking for.
results$optVariables[1:4]
# [1] "glucose" "mass" "age" "pregnant"
dplyr
答案
dplyr
Answer
# results$variables %>%
# group_by(var) %>%
# summarize(Overall = mean(Overall)) %>%
# arrange(-Overall)
#
# A tibble: 8 x 2
# var Overall
# <chr> <dbl>
# 1 glucose 34.2
# 2 mass 15.8
# 3 age 12.7
# 4 pregnant 7.92
# 5 pedigree 5.09
# 6 insulin 4.87
# 7 triceps 3.25
# 8 pressure 1.95
为什么您的尝试给出了四个以上的变量
您正在过滤40个观测值.最佳4个变量的10倍.最好的4个变量在每次折叠中并不总是相同的.因此,要在重新采样中获得最佳的前4个变量,您需要像上面的代码一样在各个方面平均它们的性能.更简单的是, optVariables
中的变量是按此顺序排序的,因此您只需抓住前4个即可(就像我的单行答案一样).这种情况的证明需要深入研究源代码(如下所示).
You are filtering 40 observations. 10 folds of the best 4 variables. The best 4 variables is not always the same within each fold. Hence, to get the best top 4 variables across the resamples you need to average their performance across the folds as the code above does. Even simpler, the variables within optVariables
are sorted in this order, so you can just grab the first 4 (as in my one-line answer). The proof that this is the case takes a bit of digging into the source code (shown below).
详细信息:深入研究源代码
处理从 rfe
之类的函数返回的对象的第一件事是尝试使用 print
, summary
或情节
.通常,将存在自定义方法,这些方法将为您提供非常有用的信息.例如...
A good first thing to do with objects returned from functions like rfe
is to try functions like print
, summary
, or plot
. Often custom methods will exist that will give you very helpful information. For example...
# Run rfe with a random seed
# library(dplyr)
# library(mlbench)
# library(caret)
# data(PimaIndiansDiabetes)
# rfFuncs$summary <- twoClassSummary
# control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# set.seed(1)
# results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8),
# rfeControl=control, metric="ROC")
#
# The next two lines identical...
results
print(results)
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold)
#
# Resampling performance over subset size:
#
# Variables ROC Sens Spec ROCSD SensSD SpecSD Selected
# 1 0.7250 0.870 0.4071 0.07300 0.07134 0.10322
# 2 0.7842 0.840 0.5677 0.04690 0.04989 0.05177
# 3 0.8004 0.824 0.5789 0.02823 0.04695 0.10456
# 4 0.8139 0.842 0.6269 0.03210 0.03458 0.05727
# 5 0.8164 0.844 0.5969 0.02850 0.02951 0.07288
# 6 0.8263 0.836 0.6078 0.03310 0.03978 0.07959
# 7 0.8314 0.844 0.5966 0.03075 0.04502 0.07232
# 8 0.8316 0.860 0.6081 0.02359 0.04522 0.07316 *
#
# The top 5 variables (out of 8):
# glucose, mass, age, pregnant, pedigree
嗯,它提供5个变量,但您说的是4.我们可以很快地深入源代码,以探索它是如何计算并将这5个变量作为前5个变量返回的.
Hmm, that gives 5 variables, but you said you wanted 4. We can pretty quickly dig into the source code to explore how it is calculating and returning those 5 variables as the top 5 variables.
print(caret:::print.rfe)
#
# Only a snippet code shown below...
# cat("The top ", min(top, x$bestSubset), " variables (out of ",
# x$bestSubset, "):\n ", paste(x$optVariables[1:min(top,
# x$bestSubset)], collapse = ", "), "\n\n", sep = "")
因此,基本上,它是直接从 results $ optVariables
中提取前5个变量.怎么填充?
So, basically it is pulling the top 5 variables directly from results$optVariables
. How is that getting populated?
# print(caret:::rfe.default)
#
# Snippet 1 of code...
# bestVar <- rfeControl$functions$selectVar(selectedVars,
bestSubset)
#
# Snippet 2 of code...
# bestSubset = bestSubset, fit = fit, optVariables = bestVar,
好, optVariables
由 rfeControl $ functions $ selectVar
填充.
print(rfeControl)
#
# Snippet of code...
# list(functions = if (is.null(functions)) caretFuncs else functions,
从上方,我们看到正在使用 caretFuncs $ selectVar
...
From above, we see that caretFuncs$selectVar
is being used...
详细信息:填充 optVariables
Details: Source code that is populating optVariables
print(caretFuncs$selectVar)
# function (y, size)
# {
# finalImp <- ddply(y[, c("Overall", "var")], .(var), function(x) mean(x$Overall,
# na.rm = TRUE))
# names(finalImp)[2] <- "Overall"
# finalImp <- finalImp[order(finalImp$Overall, decreasing = TRUE),
# ]
# as.character(finalImp$var[1:size])
# }
这篇关于从插入符号递归特征消除(rfe)结果中检索选定的变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!