问题描述
我正在处理一个巨大的人周期文件,我认为处理大型数据集的一个好方法是使用采样和重采样技术.
I am working with a gigantic person-period file and I thought thata good way to deal with a large dataset is by using sampling and re-sampling technique.
我的person-period文件是这样的
My person-period file look like this
id code time
1 1 a 1
2 1 a 2
3 1 a 3
4 2 b 1
5 2 c 2
6 2 b 3
7 3 c 1
8 3 c 2
9 3 c 3
10 4 c 1
11 4 a 2
12 4 c 3
13 5 a 1
14 5 c 2
15 5 a 3
我实际上有两个不同的问题.
第一个问题是我在简单地采样
一个人周期文件时遇到了麻烦.
The first issue is that I am having trouble in simply sampling
a person-period file.
例如,我想对 2 个 id 序列进行采样,例如:
For example, I would like to sample 2 id-sequences such as :
id code time
1 a 1
1 a 2
1 a 3
2 b 1
2 c 2
2 b 3
以下代码行用于对人周期文件进行采样
The following line of code is working for sampling a person-period file
dt[which(dt$id %in% sample(dt$id, 2)), ]
但是,我想使用 dplyr
解决方案,因为我对重采样感兴趣,特别是我想使用 replicate
.
However, I would like to use a dplyr
solution because I am interested in resampling and in particular I would like to use replicate
.
我有兴趣做类似 replicate(100, sample_n(dt, 2),simple = FALSE)
我正在努力使用 dplyr
解决方案,因为我不确定 grouping
变量应该是什么.
I am struggling with the dplyr
solution because I am not sure what should be the grouping
variable.
library(dplyr)
dt %>% group_by(id) %>% sample_n(1)
给我一个不正确的结果,因为它没有保留每个 id
的完整序列.
gives me an incorrect result because it does not keep the full sequence of each id
.
有什么线索可以对人周期文件进行采样和重新采样吗?
Any clue how I could both sample and re-sample person-period file ?
数据
dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L,
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b",
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA,
-15L), class = "data.frame")
推荐答案
我认为惯用的方式可能看起来像
I think the idiomatic way would probably look like
set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)
id code time
1 2 b 1
2 2 c 2
3 2 b 3
4 5 a 1
5 5 c 2
6 5 a 3
这直接扩展到更多分组变量和更高级的抽样规则.
This extends straightforwardly to more grouping variables and fancier sampling rules.
如果您需要多次执行此操作...
If you need to do this many times...
nrep = 100
ng = 2
samps = df %>% select(id) %>% distinct %>%
slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)
# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff
这篇关于R - 对个人周期文件进行采样和重新采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!