对个人周期文件进行采样和重新采样

对个人周期文件进行采样和重新采样

本文介绍了R - 对个人周期文件进行采样和重新采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个巨大的人周期文件,我认为处理大型数据集的一个好方法是使用采样和重采样技术.

I am working with a gigantic person-period file and I thought thata good way to deal with a large dataset is by using sampling and re-sampling technique.

我的person-period文件是这样的

My person-period file look like this

   id code time
1   1    a    1
2   1    a    2
3   1    a    3
4   2    b    1
5   2    c    2
6   2    b    3
7   3    c    1
8   3    c    2
9   3    c    3
10  4    c    1
11  4    a    2
12  4    c    3
13  5    a    1
14  5    c    2
15  5    a    3

我实际上有两个不同的问题.

第一个问题是我在简单地采样一个人周期文件时遇到了麻烦.

The first issue is that I am having trouble in simply sampling a person-period file.

例如,我想对 2 个 id 序列进行采样,例如:

For example, I would like to sample 2 id-sequences such as :

  id code time
   1    a    1
   1    a    2
   1    a    3
   2    b    1
   2    c    2
   2    b    3

以下代码行用于对人周期文件进行采样

The following line of code is working for sampling a person-period file

dt[which(dt$id %in% sample(dt$id, 2)), ]

但是,我想使用 dplyr 解决方案,因为我对重采样感兴趣,特别是我想使用 replicate.

However, I would like to use a dplyr solution because I am interested in resampling and in particular I would like to use replicate.

我有兴趣做类似 replicate(100, sample_n(dt, 2),simple = FALSE)

我正在努力使用 dplyr 解决方案,因为我不确定 grouping 变量应该是什么.

I am struggling with the dplyr solution because I am not sure what should be the grouping variable.

library(dplyr)
dt %>% group_by(id) %>% sample_n(1)

给我一​​个不正确的结果,因为它没有保留每个 id 的完整序列.

gives me an incorrect result because it does not keep the full sequence of each id.

有什么线索可以对人周期文件进行采样和重新采样吗?

Any clue how I could both sample and re-sample person-period file ?

数据

dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L,
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b",
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA,
-15L), class = "data.frame")

推荐答案

我认为惯用的方式可能看起来像

I think the idiomatic way would probably look like

set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)

  id code time
1  2    b    1
2  2    c    2
3  2    b    3
4  5    a    1
5  5    c    2
6  5    a    3

这直接扩展到更多分组变量和更高级的抽样规则.

This extends straightforwardly to more grouping variables and fancier sampling rules.

如果您需要多次执行此操作...

If you need to do this many times...

nrep = 100
ng   = 2
samps = df %>% select(id) %>% distinct %>%
  slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
  group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)

# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff

这篇关于R - 对个人周期文件进行采样和重新采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 15:23