通过标准随机分割数据，使用R进行培训和测试数据集

本文介绍了通过标准随机分割数据，使用R进行培训和测试数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Gidday，

我正在寻找一种随机分割数据框架（例如90/10 split）的方法，用于测试和训练模型，保持一定的分组标准。

I'm looking for a way to randomly split a data frame (e.g. 90/10 split) for testing and training of a model keeping a certain grouping criteria.

想象一下，我有一个这样的数据框：

Imagine I have a data frame like this:

> test[1:20,]
                companycode     year    expenses
    1                 C1          1     8.47720
    2                 C1          2     8.45250
    3                 C1          3     8.46280
    4                 C2          1 14828.90603
    5                 C3          1   665.21565
    6                 C3          2   290.66596
    7                 C3          3   865.56265
    8                 C3          4   6785.03586
    9                 C3          5   312.02617
    10                C3          6   760.48740
    11                C3          7  1155.76758
    12                C4          1  4565.78313
    13                C4          2  3340.36540
    14                C4          3  2656.73030
    15                C4          4  1079.46098
    16                C5          1    60.57039
    17                C6          1  6282.48118
    18                C6          2  7419.32720
    19                C7          1   644.90571
    20                C8          1 58332.34945

我要做的是将这个数据框拆分成一个训练和测试集使用定义的分割标准。使用提供的数据，我想以这两种数据框架中的公司不会混合的方式拆分数据。数据集1包含不同于数据集2的公司。

What I'm trying to do is to split this data frame into a training and a testing set using a defined splitting criterion. Using the provided data, I want to split the data in a way that the companies are not mixed up in both data frames. Data set 1 contains different companies than data set 2.

想象一下90/10分裂，理想的分割将如下所示：

Imagine a 90/10 split, an ideal split would look like this:

> data_90split

           companycode     year    expenses

        4                 C2          1 14828.90603
        12                C4          1  4565.78313
        13                C4          2  3340.36540
        14                C4          3  2656.73030
        15                C4          4  1079.46098
        16                C5          1    60.57039
        5                 C3          1   665.21565
        6                 C3          2   290.66596
        7                 C3          3   865.56265
        8                 C3          4   6785.03586
        9                 C3          5   312.02617
        10                C3          6   760.48740
        11                C3          7  1155.76758
        17                C6          1  6282.48118
        18                C6          2  7419.32720
        1                 C1          1     8.47720
        2                 C1          2     8.45250
        3                 C1          3     8.46280



 > data_10split
                    companycode     year   expenses
        20                C8          1 58332.34945
        19                C7          1   644.90571

我希望我能清楚地指出我在找什么。
感谢您的反馈。

I hope I pointed out clearly what I'm looking for.Thanks for your feedback.

推荐答案

comps <- levels(df$companycode)

trn <- sample(comps, length(comps)*0.9)

df.trn <- subset(df, companycode %in% trn)
df.tst <- subset(df, !(companycode %in% trn))

百分比的公司在培训集中，其余的在测试集中。

This splits your data so that 90% of companies are in the training set and the rest in the test set.

这样做不保证90％的行将成为培训和10％测试。实现这一目标的严格方法是作为读者的练习。非严格的方法是重复抽样，直到得到大致正确的比例。

This does not guarantee that 90% of your rows will be training and 10% test. The rigorous way to achieve this is left as an exercise for the reader. The non-rigorous way would be to repeat the sampling until you get proportions that are roughly correct.

这篇关于通过标准随机分割数据，使用R进行培训和测试数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！