问题描述
我写了一个小函数将数据集划分为训练和测试集。但是,在处理因子变量时遇到了麻烦。在我的代码的模型验证阶段,如果模型建立在一个数据集上,而该数据集没有一个因子的每个层次的表示形式,那么我会报错。如何修复这个partition()函数,以便在每个级别的因子变量中至少包含一个观察值?
I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable?
test.df <- data.frame(a = sample(c(0,1),100, rep = T),
b = factor(sample(letters, 100, rep = T)),
c = factor(sample(c("apple", "orange"), 100, rep = T)))
set.seed(123)
partition <- function(data, train.size = .7){
train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ]
test <- data[-as.numeric(row.names(train)), ]
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
part.data <- partition(test.df)
table(part.data$train[,'b'])
table(part.data$test[,'b'])
EDIT -使用'插入符''包和createDataPartition():
EDIT - New function using 'caret' package and createDataPartition():
partition <- function(data, factor=NULL, train.size = .7){
if (("package:caret" %in% search()) == FALSE){
stop("Install and Load 'caret' package")
}
if (is.null(factor)){
train.index <- createDataPartition(as.numeric(row.names(data)),
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
else{
train.index <- createDataPartition(factor,
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
推荐答案
尝试插入符号包,尤其是函数 createDataPartition()
。它应该完全满足您的需要,可以在CRAN的主页上找到:
Try the caret package, particularly the function createDataPartition()
. It should do exactly what you need, available on CRAN, homepage is here:
我提到的功能部分是一些我在网上找到的代码,然后我对其进行了一些修改,以更好地处理边缘情况(例如,当您要求样本大于集合或子集时)。
The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).
stratified <- function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
colsToReturn <- ncol(df)
#Don't want to attempt to sample more than possible
dfCounts <- table(df[group])
if (size > min(dfCounts)) {
size <- min(dfCounts)
}
if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))
dsample <- dsample[order(dsample[1]),]
dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
return(dsample)
}
这篇关于如何确保分区在每个因子级别具有代表性观察结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!