问题描述
在R中对数据进行采样/拆分的一种常用方法是对行号使用 sample
。例如:
A common way for sampling/splitting data in R is using sample
, e.g., on row numbers. For example:
require(data.table)
set.seed(1)
population <- as.character(1e5:(1e6-1)) # some made up ID names
N <- 1e4 # sample size
sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]
问题是这不是很可靠更改数据。例如,如果我们仅丢弃一个观察值:
The problem is that this isn't very robust to changes in the data. For example if we drop just one observation:
sample2 <- sample1[-sample(N, 1)]
样本1和2仍然完全相同:
samples 1 and 2 are still all but identical:
nrow(merge(sample1, sample2))
[1 ] 9999
[1] 9999
即使我们设置了种子,相同的行拆分也会产生非常不同的测试集:
Yet the same row splitting yields very different test sets, even though we've set the seed:
test2 <- sample2[test, .(id)]
nrow(test1)
[1] 5000
nrow(merge(test1, test2))
[1] 2653
可以抽样ID,但是在省略或添加观察值的情况下,这将不可靠。
One could sample specific IDs, but this would not be robust in case observations are omitted or added.
如何使拆分对数据的更改更可靠?也就是说,对于未更改的观测值,测试的分配是否不变,是否不分配丢弃的观测值,而是重新分配新的观测值?
What would be a way to make the split more robust to changes to the data? Namely, have the assignment to test unchanged for unchanged observations, not assign dropped observations, and reassign new observations?
推荐答案
使用哈希函数和最后一位数字的mod上的样本:
Use a hash function and sample on the mod of its last digit:
md5_bit_mod <- function(x, m = 2L) {
# Inputs:
# x: a character vector of ids
# m: the modulo divisor (modify for split proportions other than 50:50)
# Output: remainders from dividing the first digit of the md5 hash of x by m
as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}
在这种情况下,散列拆分效果更好,因为测试/训练的分配由每个对象的哈希决定。不是按其在数据中的相对位置
hash splitting works better in this case, because the assignment of test/train is determined by the hash of each obs., and not by its relative location in the data
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))
[1] 5057
nrow(test1a)
[1] 5057
由于分配是概率性的,因此样本大小不完全是5000
sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.
另请参见:和
这篇关于可复制地将数据分为R中的训练和测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!