问题描述
Gidday,
我正在寻找一种随机分割数据框架(例如90/10 split)的方法,用于测试和训练模型,保持一定的分组标准。
I'm looking for a way to randomly split a data frame (e.g. 90/10 split) for testing and training of a model keeping a certain grouping criteria.
想象一下,我有一个这样的数据框:
Imagine I have a data frame like this:
> test[1:20,]
companycode year expenses
1 C1 1 8.47720
2 C1 2 8.45250
3 C1 3 8.46280
4 C2 1 14828.90603
5 C3 1 665.21565
6 C3 2 290.66596
7 C3 3 865.56265
8 C3 4 6785.03586
9 C3 5 312.02617
10 C3 6 760.48740
11 C3 7 1155.76758
12 C4 1 4565.78313
13 C4 2 3340.36540
14 C4 3 2656.73030
15 C4 4 1079.46098
16 C5 1 60.57039
17 C6 1 6282.48118
18 C6 2 7419.32720
19 C7 1 644.90571
20 C8 1 58332.34945
我要做的是将这个数据框拆分成一个训练和测试集使用定义的分割标准。使用提供的数据,我想以这两种数据框架中的公司不会混合的方式拆分数据。数据集1包含不同于数据集2的公司。
What I'm trying to do is to split this data frame into a training and a testing set using a defined splitting criterion. Using the provided data, I want to split the data in a way that the companies are not mixed up in both data frames. Data set 1 contains different companies than data set 2.
想象一下90/10分裂,理想的分割将如下所示:
Imagine a 90/10 split, an ideal split would look like this:
> data_90split
companycode year expenses
4 C2 1 14828.90603
12 C4 1 4565.78313
13 C4 2 3340.36540
14 C4 3 2656.73030
15 C4 4 1079.46098
16 C5 1 60.57039
5 C3 1 665.21565
6 C3 2 290.66596
7 C3 3 865.56265
8 C3 4 6785.03586
9 C3 5 312.02617
10 C3 6 760.48740
11 C3 7 1155.76758
17 C6 1 6282.48118
18 C6 2 7419.32720
1 C1 1 8.47720
2 C1 2 8.45250
3 C1 3 8.46280
> data_10split
companycode year expenses
20 C8 1 58332.34945
19 C7 1 644.90571
我希望我能清楚地指出我在找什么。
感谢您的反馈。
I hope I pointed out clearly what I'm looking for.Thanks for your feedback.
推荐答案
comps <- levels(df$companycode)
trn <- sample(comps, length(comps)*0.9)
df.trn <- subset(df, companycode %in% trn)
df.tst <- subset(df, !(companycode %in% trn))
百分比的公司在培训集中,其余的在测试集中。
This splits your data so that 90% of companies are in the training set and the rest in the test set.
这样做不保证90%的行将成为培训和10%测试。实现这一目标的严格方法是作为读者的练习。非严格的方法是重复抽样,直到得到大致正确的比例。
This does not guarantee that 90% of your rows will be training and 10% test. The rigorous way to achieve this is left as an exercise for the reader. The non-rigorous way would be to repeat the sampling until you get proportions that are roughly correct.
这篇关于通过标准随机分割数据,使用R进行培训和测试数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!