问题描述
我以前从我的数据帧中随机抽取了一个邮政编码样本,然后意识到我并没有在所有更高级别的统计单位中进行抽样.我有大约一百万个邮政编码和7000个中间输出统计单位.我希望样本中每个统计单位的邮政编码数量大致相同.
I previously took a random sample of postcodes from my dataframe and then realised that I wasn't sampling across all higher level statistical units. I have around 1 million postcodes and 7000 middle output statistical units. I want the sample to have roughly the same number of postcodes from each statistical unit.
如何从每个较高级别的统计单位中随机抽取35个邮政编码?
How do I randomly sample 35 postcodes from each higher level statistical unit?
我以前使用以下代码随机采样250,000个邮政编码:
I used the following code previously to randomly sample 250,000 postcodes:
total.sample <- total[sample(1:nrow(total), 250000,
replace=FALSE),]
如何根据另一个列变量(例如较高级别的统计单位(请参见下面的数据框结构中的msoa.rank))指定邮政编码的随机样本配额?
How do I specify a random sample quota of postcodes based on another column variable (e.g. such as the higher level statistical unit (see msoa.rank in the dataframe structure below))?
数据库结构:
'data.frame': 1096289 obs. of 25 variables:
$ pcd : Factor w/ 986055 levels "AL100AB","AL100AD",..: 282268 282258
$ mbps2 : int 0 1 0 0 0 1 0 0 0 0 ...
$ averagesp : num 16 7.8 7.8 9.5 9.4 3.2 11.1 19.4 10.5 11.8 ...
$ mediansp : num 18.2 8 7.8 8.1 8.5 3.2 8.1 18.7 9.7 8.9 ...
$ nga : int 0 0 0 0 0 0 0 0 0 0 ...
$ x : int 533432 532192 533416 533223 532866 531394 532899 532744
$ total.dps : int 11 91 10 7 9 10 3 5 21 12 ...
$ connections.density: num 7.909 0.747 3.1 7.714 1.889 ...
$ urban : int 1 1 1 1 1 1 1 1 1 1 ...
$ gross.pay : num 36607 36607 36607 36607 36607 ...
$ p.tert : num 98.8 98.8 98.8 98.8 98.8 ...
$ p.kibs : num 70.3 70.3 70.3 70.3 70.3 ...
$ density : num 25.5 25.5 25.5 25.5 25.5 25.5 25.5 25.5 25.5 25.5 ...
$ p_m_s : num 93.5 93.5 93.5 93.5 93.5 ...
$ p_m_l : num 6.52 6.52 6.52 6.52 6.52 ...
$ p.edu : num 62.6 62.6 62.6 62.6 62.6 ...
$ p.claim : num 1.58 1.58 1.58 1.58 1.58 ...
$ p.non.white : num 21.4 21.4 21.4 21.4 21.4 21.4 21.4 21.4 21.4 21.4 ...
$ msoa.rank : int 2 2 2 2 2 2 2 2 2 2 ...
$ oslaua.rank : int 321 321 321 321 321 321 321 321 321 321 ...
$ nuts2.rank : int 22 22 22 22 22 22 22 22 22 22 ...
$ gor.rank : int 8 8 8 8 8 8 8 8 8 8 ...
$ cons : int 1 1 1 1 1 1 1 1 1 1 ...
pcd =邮政编码
msoa.rank =每个中间输出统计单元的序数变量
msoa.rank = the ordinal variable of each middle output statistical unit
推荐答案
每个msoa.rank是否至少具有35个邮政编码? data.table
Does every msoa.rank have at least 35 postcodes? This will be fast with data.table
#Create a data.table object
require(data.table)
total <- data.table(total)
#Sample by each msoa.rank group (take a sample that is size min(35,total size of msoa grp)
total.sample <- total[ , .SD[sample(1:.N,min(35,.N))], by=msoa.rank]
因此,这是使用经典iris
数据集的示例工作方式.
So here is how to example would work using the classic iris
dataset.
iris < data.table(iris)
set.seed(2014)
iris.sample <- iris[ , .SD[sample(1:.N,min(10,.N))], by=Species]
summary(iris.sample$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.400 5.000 5.850 5.797 6.525 7.200
这是另一个示例和摘要,以查看区别
Here is another sample and summary to see the difference
iris.sample2 <- iris[ , .SD[sample(1:.N,min(10,.N))], by=Species]
summary(iris.sample2$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.400 5.100 5.850 5.743 6.275 7.300
这篇关于R:随机抽样各种类别的观测值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!