可复制地将数据分为R中的训练和测试

本文介绍了可复制地将数据分为R中的训练和测试的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在R中对数据进行采样/拆分的一种常用方法是对行号使用 sample 。例如：

A common way for sampling/splitting data in R is using sample, e.g., on row numbers. For example:

require(data.table)
set.seed(1)

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]

问题是这不是很可靠更改数据。例如，如果我们仅丢弃一个观察值：

The problem is that this isn't very robust to changes in the data. For example if we drop just one observation:

sample2 <- sample1[-sample(N, 1)]

样本1和2仍然完全相同：

samples 1 and 2 are still all but identical:

nrow(merge(sample1, sample2))

[1 ] 9999

[1] 9999

即使我们设置了种子，相同的行拆分也会产生非常不同的测试集：

Yet the same row splitting yields very different test sets, even though we've set the seed:

test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

可以抽样ID，但是在省略或添加观察值的情况下，这将不可靠。

One could sample specific IDs, but this would not be robust in case observations are omitted or added.

如何使拆分对数据的更改更可靠？也就是说，对于未更改的观测值，测试的分配是否不变，是否不分配丢弃的观测值，而是重新分配新的观测值？

What would be a way to make the split more robust to changes to the data? Namely, have the assignment to test unchanged for unchanged observations, not assign dropped observations, and reassign new observations?

推荐答案

使用哈希函数和最后一位数字的mod上的样本：

Use a hash function and sample on the mod of its last digit:

md5_bit_mod <- function(x, m = 2L) {
  # Inputs:
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

在这种情况下，散列拆分效果更好，因为测试/训练的分配由每个对象的哈希决定。不是按其在数据中的相对位置

hash splitting works better in this case, because the assignment of test/train is determined by the hash of each obs., and not by its relative location in the data

test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]

nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

由于分配是概率性的，因此样本大小不完全是5000

sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.

另请参见：和

这篇关于可复制地将数据分为R中的训练和测试的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！