


In my working project, I use rfe function from caret package to do recursive feature elimination. I use a toy example to illustrate my point.


rfFuncs$summary <- twoClassSummary
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control, metric="ROC")

The optimal variable selected is based on those variables that give highest auroc in the process and can be retrieved by results$optVariables. However, what I want to do is use '1 standard error rule' to select less features (code below). The number of variables identified is 4.

# auc that is 1-se from the highest auc 
df.results = results$results %>% dplyr::mutate(ROCSE = ROCSD/sqrt(10-1))
idx = which.max(df.results$ROC)
ROC.1se = df.results$ROC[idx] - df.results$ROCSE[idx]

# plot ROC vs feature size
g = ggplot(df.results, aes(x=Variables, y=ROC)) + 
    geom_errorbar(aes(ymin=ROC-ROCSE, ymax=ROC+ROCSE), 
                  width=.2, alpha=0.4, linetype=1) +
    geom_line() + 
    geom_hline(yintercept = ROC.1se)+
    labs(x ="Number of Variables", y = "AUROC")


The number of variables I identified is 4. Now I need to know which four variables. I did below:

results$variables %>% filter(Variables==4) %>% distinct(var)



Does anyone know how I can retrieve those variables? Basically it applies to get those variables for any number of variables selected.





If you know you want only the best 4 variables from the rfe resampling, this will give you what you are looking for.

# [1] "glucose"  "mass"     "age"      "pregnant"

dplyr Answer

# results$variables %>%
#    group_by(var) %>%
#    summarize(Overall = mean(Overall)) %>%
#    arrange(-Overall)
# A tibble: 8 x 2
#   var      Overall
#   <chr>      <dbl>
# 1 glucose    34.2 
# 2 mass       15.8 
# 3 age        12.7 
# 4 pregnant    7.92
# 5 pedigree    5.09
# 6 insulin     4.87
# 7 triceps     3.25
# 8 pressure    1.95


You are filtering 40 observations. 10 folds of the best 4 variables. The best 4 variables is not always the same within each fold. Hence, to get the best top 4 variables across the resamples you need to average their performance across the folds as the code above does. Even simpler, the variables within optVariables are sorted in this order, so you can just grab the first 4 (as in my one-line answer). The proof that this is the case takes a bit of digging into the source code (shown below).


A good first thing to do with objects returned from functions like rfe is to try functions like print, summary, or plot. Often custom methods will exist that will give you very helpful information. For example...

# Run rfe with a random seed
# library(dplyr)
# library(mlbench)
# library(caret)
# data(PimaIndiansDiabetes)
# rfFuncs$summary <- twoClassSummary
# control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# set.seed(1)
# results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), 
# rfeControl=control, metric="ROC")
# The next two lines identical...
# Recursive feature selection
# Outer resampling method: Cross-Validated (10 fold)
# Resampling performance over subset size:
# Variables    ROC  Sens   Spec   ROCSD  SensSD  SpecSD Selected
#          1 0.7250 0.870 0.4071 0.07300 0.07134 0.10322         
#          2 0.7842 0.840 0.5677 0.04690 0.04989 0.05177         
#          3 0.8004 0.824 0.5789 0.02823 0.04695 0.10456         
#          4 0.8139 0.842 0.6269 0.03210 0.03458 0.05727         
#          5 0.8164 0.844 0.5969 0.02850 0.02951 0.07288         
#          6 0.8263 0.836 0.6078 0.03310 0.03978 0.07959         
#          7 0.8314 0.844 0.5966 0.03075 0.04502 0.07232         
#          8 0.8316 0.860 0.6081 0.02359 0.04522 0.07316        *
# The top 5 variables (out of 8):
#    glucose, mass, age, pregnant, pedigree


Hmm, that gives 5 variables, but you said you wanted 4. We can pretty quickly dig into the source code to explore how it is calculating and returning those 5 variables as the top 5 variables.

# Only a snippet code shown below...
#    cat("The top ", min(top, x$bestSubset), " variables (out of ", 
#        x$bestSubset, "):\n   ", paste(x$optVariables[1:min(top, 
#            x$bestSubset)], collapse = ", "), "\n\n", sep = "")

So, basically it is pulling the top 5 variables directly from results$optVariables. How is that getting populated?

# print(caret:::rfe.default)
# Snippet 1 of code...
#    bestVar <- rfeControl$functions$selectVar(selectedVars, 
# Snippet 2 of code...
#        bestSubset = bestSubset, fit = fit, optVariables = bestVar,

# Snippet of code...
# list(functions = if (is.null(functions)) caretFuncs else functions, 

From above, we see that caretFuncs$selectVar is being used...

Details: Source code that is populating optVariables

# function (y, size)
# {
#    finalImp <- ddply(y[, c("Overall", "var")], .(var), function(x) mean(x$Overall, 
#        na.rm = TRUE))
#    names(finalImp)[2] <- "Overall"
#    finalImp <- finalImp[order(finalImp$Overall, decreasing = TRUE), 
#        ]
#    as.character(finalImp$var[1:size])
# }


