问题描述
我有一个包含两个结果变量的数据集,case1 和 case2.Case1 有 4 个级别,而 case2 有 50 个(case2 中的级别可以稍后增加).我想为训练和测试创建数据分区,以保持两种情况下的比率.case1 和 case2 的真实数据都是不平衡的.例如,
I have a data set with two outcome variables, case1 and case2. Case1 has 4 levels, while case2 has 50 (levels in case2 could increase later). I would like to create data partition for train and test keeping the ratio in both cases. The real data is imbalanced for both case1 and case2. As an example,
library(caret)
set.seed(123)
matris=matrix(rnorm(10),1000,20)
case1 <- as.factor(ceiling(runif(1000, 0, 4)))
case2 <- as.factor(ceiling(runif(1000, 0, 50)))
df <- as.data.frame(matris)
df$case1 <- case1
df$case2 <- case2
split1 <- createDataPartition(df$case1, p=0.2)[[1]]
train1 <- df[-split1,]
test1 <- df[split1,]
length(split1)
201
split2 <- createDataPartition(df$case2, p=0.2)[[1]]
train2 <- df[-split2,]
test2 <- df[split2,]
length(split2)
220
如果我进行单独拆分,数据框的长度会有所不同.如果我根据 case2 进行一次拆分(一个有更多类),我会失去 case1 的类比.
If I do separate splitting, I get different length for the data frame. If I do one splitting based on case2 (one with more classes), I lose the ratio of classes for case1.
我将分别预测这两种情况,但最后我的准确性将通过两种情况的完全匹配来给出(例如,ix = which(pred1 == case1 & pred2 == case2),所以我需要数组大小相同.
I will be predicting the two cases separately, but at the end my accuracy will be given by having the exact match for both cases (e.g., ix = which(pred1 == case1 & pred2 == case2), so I need the arrays to be the same size.
有没有聪明的方法来做到这一点?
Is there a smart way to do this?
谢谢!
推荐答案
如果我理解正确(我不保证),我可以提供以下方法:
If I understand correctly (which I do not guarantee) I can offer the following approach:
按 case1 和 case2 分组,得到分组索引
Group by case1 and case2 and get the group indices
library(tidyverse)
df %>%
select(case1, case2) %>%
group_by(case1, case2) %>%
group_indices() -> indeces
在创建数据分区时使用这些索引作为结果变量:
use these indeces as the outcome variable in create data partition:
split1 <- createDataPartition(as.factor(indeces), p=0.2)[[1]]
检查是否满意:
table(df[split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
5 6 5 8 5 5 6 6 4 6 6 6 6 6 5 5 5 4 4 7 5 6 5 6 7 5 5 8 6 7 6 6 7
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
4 5 6 6 6 5 5 6 5 6 6 5 4 5 6 4 6
table(df[-split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
15 19 13 18 12 13 16 15 8 13 13 15 21 14 11 13 12 9 12 20 17 15 16 19 16 11 14 21 13 20 18 13 16
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
9 6 12 19 14 10 16 19 17 17 16 14 4 15 14 9 19
table(df[split1,21])
#output
1 2 3 4
71 70 71 67
table(df[-split1,21])
1 2 3 4
176 193 174 178
这篇关于基于两个变量创建分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!