问题描述
我不是调查方法学家或人口统计学家,而是Thomas Lumley的R调查软件包的狂热爱好者.我一直在使用一个相对较大的复杂调查数据集,即医疗保健成本和利用项目(HCUP)国家紧急部门样本( NEDS ).如医疗保健研究与质量局(Agency for Healthcare Research and Quality)所述,这是位于30个州的947家医院的急诊就诊出院数据,约占美国医院急诊的分层样本的20%"
2006年至2012年的完整数据集包含198,102,435个观测值.我将数据细分为40,073,358个与66个变量相关的与创伤相关的出院.对这些数据运行甚至简单的调查程序都将花费大量时间.我尝试使用多核(如果可用),子设置,使用内存不足的DBMS ,例如 MonetDB .基于设计的调查程序仍然需要几个小时.有时很多小时.某些中等复杂的分析可能需要15个小时以上的时间.我猜想大多数计算工作都与庞大的协方差矩阵有关?
正如人们可能期望的那样,处理原始数据的速度要快几个数量级.更有趣的是,根据操作步骤,使用如此大的数据集,未经调整的估计值可能非常接近调查结果. (请参见下面的示例)基于设计的结果显然更精确且更可取,但相对于增加的精度而言,数小时的计算时间而不是数秒是不小的代价.开始看起来像是在街区上走很长一段路.有人有经验吗?有什么方法可以针对大型数据集优化R调查程序?也许更好地利用并行处理?是使用 INLA 或哈密顿语方法是否可行?或者,当调查规模较大且具有足够的代表性时,是否可以接受一些未经调整的估计值,尤其是相对度量的未调整估计值?
以下是一些未经调整的估算值,它们近似于调查结果.
在第一个示例中,内存中的svymean花了不到一个小时的时间,而内存不足则需要3个小时以上.直接计算花了不到一秒钟的时间.更重要的是,点估计(svymean为34.75,未经调整为34.77)与标准误差(0.0039和0.0037)非常接近.
# 1a. svymean in memory
svydes<- svydesign(
id = ~KEY_ED ,
strata = ~interaction(NEDS_STRATUM , YEAR), note YEAR interaction
weights = ~DISCWT ,
nest = TRUE,
data = inj
)
system.time(meanAGE<-svymean(~age, svydes, na.rm=T))
user system elapsed
3082.131 143.628 3208.822
> meanAGE
mean SE
age 34.746 0.0039
# 1b. svymean out of memory
db_design <-
svydesign(
weight = ~discwt , weight variable column
nest = TRUE , whether or not psus are nested within strata
strata = ~interaction(neds_stratum , yr) , stratification variable column
id = ~key_ed ,
data = "nedsinj0612" , table name within the monet database
dbtype = "MonetDBLite" ,
dbname = "~/HCUP/HCUP NEDS/monet" folder location
)
system.time(meanAGE<-svymean(~age, db_design, na.rm=T))
user system elapsed
11749.302 549.609 12224.233
Warning message:
'isIdCurrent' is deprecated.
Use 'dbIsValid' instead.
See help("Deprecated")
mean SE
age 34.746 0.0039
# 1.c unadjusted mean and s.e.
system.time(print(mean(inj$AGE, na.rm=T)))
[1] 34.77108
user system elapsed
0.407 0.249 0.653
sterr <- function(x) sd(x, na.rm=T)/sqrt(length(x)) # write little function for s.e.
system.time(print(sterr(inj$AGE)))
[1] 0.003706483
user system elapsed
0.257 0.139 0.394
svymean结果与使用svyby(将近2小时)与tapply(大约4秒)应用于数据子集的平均值之间存在相似的对应关系:
# 2.a svyby .. svymean
system.time(AGEbyYear<-svyby(~age, ~yr, db_design, svymean, na.rm=T, vartype = c( 'ci' , 'se' )))
user system elapsed
4600.050 376.661 6594.196
yr age se ci_l ci_u
2006 2006 33.83112 0.009939669 33.81163 33.85060
2007 2007 34.07261 0.010055909 34.05290 34.09232
2008 2008 34.57061 0.009968646 34.55107 34.59014
2009 2009 34.87537 0.010577461 34.85464 34.89610
2010 2010 35.31072 0.010465413 35.29021 35.33124
2011 2011 35.33135 0.010312395 35.31114 35.35157
2012 2012 35.30092 0.010313871 35.28071 35.32114
# 2.b tapply ... mean
system.time(print(tapply(inj$AGE, inj$YEAR, mean, na.rm=T)))
2006 2007 2008 2009 2010 2011 2012
33.86900 34.08656 34.60711 34.81538 35.27819 35.36932 35.38931
user system elapsed
3.388 1.166 4.529
system.time(print(tapply(inj$AGE, inj$YEAR, sterr)))
2006 2007 2008 2009 2010 2011 2012
0.009577755 0.009620235 0.009565588 0.009936695 0.009906659 0.010148218 0.009880995
user system elapsed
3.237 0.990 4.186
调查和未经调整的结果之间的对应关系开始因绝对计数而中断,这需要编写一个吸引调查对象的小函数,并使用Lumley博士的一些代码对计数进行加权:
# 3.a svytotal
system.time(print(svytotal(~adj_cost, svydes, na.rm=T)))
total SE
adj_cost 9.975e+10 26685092
user system elapsed
10005.837 610.701 10577.755
# 3.b "direct" calculation
SurvTot<-function(x){
N <- sum(1/svydes$prob)
m <- mean(x, na.rm = T)
total <- m * N
return(total)
}
> system.time(print(SurvTot(inj$adj_cost)))
[1] 1.18511e+11
user system elapsed
0.735 0.311 0.989
结果差强人意.尽管仍在调查程序确定的误差范围内.但是同样,3小时vs. 1秒对于更精确的结果来说是相当可观的成本.
更新:2016年2月10日
感谢Severin和Anthony让我借用您的突触.很抱歉延迟跟进,我们花了很少的时间来尝试您的两个建议.
Severin,您正确地发现,Revolution Analytics/MOR构建对于某些操作而言更快.看起来它与CRAN R附带的BLAS(基本线性代数子程序")库有关.它更精确,但速度更慢.因此,我使用允许多线程处理的专有Apple Accelerate vecLib(但免费提供Mac)在我的机器上优化了BLAS(请参阅 http://blog.quadrivio.com/2015/06/improved-r-performance-with-openblas.html ).这似乎节省了一些时间,例如从一个svyby/svymean的3个小时到2个小时的一点.
安东尼,复制重量设计方法的运气较差.在我退出之前,重复次数= 20的type ="bootstrap"跑了大约39个小时; type ="BRR"返回错误无法在层中用奇数个PSU拆分",当我将选项设置为small ="merge",large ="merge"时,它运行了几个小时才被操作系统重载.叹了口气,用完了应用程序内存; type ="JKn"返回错误:无法分配大小为11964693.8 Gb的向量"再次感谢您的建议.我现在要辞职,要长时间不定期地进行这些分析.如果我最终想出一种更好的方法,我将在SO
上发布对于大型数据集,线性化设计(svydesign
)比复制设计(svrepdesign
)慢得多.查看survey::as.svrepdesign
中的加权函数,并使用其中之一直接进行复制设计.您不能为此任务使用线性化.而且您最好甚至不使用as.svrepdesign
,而是使用其中的功能.
有关将cluster=
,strata=
和fpc=
直接用于复制加权设计的示例,请参见
请注意,您还可以在此处 http://monetdb.cwi.nl/testweb/web/eanthony/
还请注意,replicates=
参数几乎负责100%设计的运行速度.因此,也许要进行两种设计,一种设计用于系数(仅重复两次),另一种设计用于SE(可以忍受的数量最多).交互地运行系数并优化白天需要的数字,然后将需要SE计算的较大过程留在一夜之间
I am not a survey methodologist or demographer, but am an avid fan of Thomas Lumley's R survey package. I've been working with a relatively large complex survey data set, the Healthcare Cost and Utilization Project (HCUP) National Emergency Department Sample (NEDS). As described by the Agency for Healthcare Research and Quality, it is "Discharge data for ED visits from 947 hospitals located in 30 States, approximating a 20-percent stratified sample of U.S. hospital-based EDs"
The full dataset from 2006 to 2012 consists of 198,102,435 observations. I've subsetted the data to 40,073,358 traumatic injury-related discharges with 66 variables. Running even simple survey procedures on these data takes inordinately long amounts of time. I've tried throwing RAM at it (late 2013 Mac Pro, 3.7GHz Quad Core, 128GB (!) memory), using multicore when available, subsetting , working with an out-of-memory DBMS like MonetDB. Design-based survey procedures still take hours. Sometimes many hours. Some modestly complex analyses take upwards of 15 hours. I am guessing that most of the computational effort is tied to what must be a humongous covariance matrix?
As one might expect, working with the raw data is orders of magnitude faster. More interestingly, depending on the procedure, with a data set this large the unadjusted estimates can be quite close to the survey results. (See examples below) The design-based results are clearly more precise and preferred, but several hours of computing time vs seconds is a not inconsiderable cost for that added precision. It begins to look like a very long walk around the block.
Is there anyone who's had experience with this? Are there ways to optimize R survey procedures for large data sets? Perhaps make better use of parallel processing? Are Bayesian approaches using INLA or Hamiltonian methods like Stan a possible solution? Or, are some unadjusted estimates, especially for relative measures, acceptable when the survey is large and representative enough?
Here are a couple of examples of unadjusted estimates approximating survey results.
In this first example, svymean in memory took a bit less than an hour, out of memory required well over 3 hours. The direct calculation took under a second. More importantly, the point estimates (34.75 for svymean and 34.77 unadjusted) as well as the standard errors (0.0039 and 0.0037) are quite close.
# 1a. svymean in memory
svydes<- svydesign(
id = ~KEY_ED ,
strata = ~interaction(NEDS_STRATUM , YEAR), note YEAR interaction
weights = ~DISCWT ,
nest = TRUE,
data = inj
)
system.time(meanAGE<-svymean(~age, svydes, na.rm=T))
user system elapsed
3082.131 143.628 3208.822
> meanAGE
mean SE
age 34.746 0.0039
# 1b. svymean out of memory
db_design <-
svydesign(
weight = ~discwt , weight variable column
nest = TRUE , whether or not psus are nested within strata
strata = ~interaction(neds_stratum , yr) , stratification variable column
id = ~key_ed ,
data = "nedsinj0612" , table name within the monet database
dbtype = "MonetDBLite" ,
dbname = "~/HCUP/HCUP NEDS/monet" folder location
)
system.time(meanAGE<-svymean(~age, db_design, na.rm=T))
user system elapsed
11749.302 549.609 12224.233
Warning message:
'isIdCurrent' is deprecated.
Use 'dbIsValid' instead.
See help("Deprecated")
mean SE
age 34.746 0.0039
# 1.c unadjusted mean and s.e.
system.time(print(mean(inj$AGE, na.rm=T)))
[1] 34.77108
user system elapsed
0.407 0.249 0.653
sterr <- function(x) sd(x, na.rm=T)/sqrt(length(x)) # write little function for s.e.
system.time(print(sterr(inj$AGE)))
[1] 0.003706483
user system elapsed
0.257 0.139 0.394
There is a similar correspondence between the results of svymean vs mean applied to subsets of data using svyby (nearly 2 hours) vs tapply (4 seconds or so):
# 2.a svyby .. svymean
system.time(AGEbyYear<-svyby(~age, ~yr, db_design, svymean, na.rm=T, vartype = c( 'ci' , 'se' )))
user system elapsed
4600.050 376.661 6594.196
yr age se ci_l ci_u
2006 2006 33.83112 0.009939669 33.81163 33.85060
2007 2007 34.07261 0.010055909 34.05290 34.09232
2008 2008 34.57061 0.009968646 34.55107 34.59014
2009 2009 34.87537 0.010577461 34.85464 34.89610
2010 2010 35.31072 0.010465413 35.29021 35.33124
2011 2011 35.33135 0.010312395 35.31114 35.35157
2012 2012 35.30092 0.010313871 35.28071 35.32114
# 2.b tapply ... mean
system.time(print(tapply(inj$AGE, inj$YEAR, mean, na.rm=T)))
2006 2007 2008 2009 2010 2011 2012
33.86900 34.08656 34.60711 34.81538 35.27819 35.36932 35.38931
user system elapsed
3.388 1.166 4.529
system.time(print(tapply(inj$AGE, inj$YEAR, sterr)))
2006 2007 2008 2009 2010 2011 2012
0.009577755 0.009620235 0.009565588 0.009936695 0.009906659 0.010148218 0.009880995
user system elapsed
3.237 0.990 4.186
The correspondence between survey and unadjusted results starts to break down with absolute counts, which requires writing a small function that appeals to the the survey object and uses a small bit some of Dr. Lumley's code to weight the counts:
# 3.a svytotal
system.time(print(svytotal(~adj_cost, svydes, na.rm=T)))
total SE
adj_cost 9.975e+10 26685092
user system elapsed
10005.837 610.701 10577.755
# 3.b "direct" calculation
SurvTot<-function(x){
N <- sum(1/svydes$prob)
m <- mean(x, na.rm = T)
total <- m * N
return(total)
}
> system.time(print(SurvTot(inj$adj_cost)))
[1] 1.18511e+11
user system elapsed
0.735 0.311 0.989
The results are much less acceptable. Though still within the margin of error established by the survey procedure. But again, 3 hours vs. 1 second is an appreciable cost for the more precise results.
Update: 10 Feb 2016
Thanks Severin and Anthony for allowing me to borrow your synapses. Sorry for the delay in following up, has taken little time to try out both your suggestions.
Severin , you are right in your observations that Revolution Analytics/MOR build is faster for some operations. Looks like it has to do with the BLAS ("Basic Linear Algebra Subprograms") library shipped with CRAN R. It is more precise, but slower. So, I optimized the BLAS on my maching with the proprietary (but free with macs) Apple Accelerate vecLib that allows multithreading (see http://blog.quadrivio.com/2015/06/improved-r-performance-with-openblas.html). This seemed to shave some time off the operations, e.g. from 3 hours for a svyby/svymean to a bit over 2 hours.
Anthony, had less luck with the replicate weight design approach. type="bootstrap" with replicates=20 ran for about 39 hours before I quit out; type="BRR" returned error "Can't split with odd numbers of PSUs in a stratum", when I set the options to small="merge", large="merge", it ran for several hours before the OS heaved a huge sigh and ran out of application memory; type="JKn" returned he error "cannot allocate vector of size 11964693.8 Gb"
Again, many thanks for your suggestions. I will for now, resign myself to running these analyses piecemeal and over long periods of time. If I do eventually come up with a better approach, I'll post on SO
for huge data sets, linearized designs (svydesign
) are much slower than replication designs (svrepdesign
). review the weighting functions within survey::as.svrepdesign
and use one of them to directly make a replication design. you cannot use linearization for this task. and you are likely better off not even using as.svrepdesign
but instead using the functions within it.
for one example using cluster=
, strata=
, and fpc=
directly into a replicate-weighted design, see
note you can also view minute-by-minute speed tests (with timestamps for each event) here http://monetdb.cwi.nl/testweb/web/eanthony/
also note that the replicates=
argument is nearly 100% responsible for the speed that the design will run. so perhaps make two designs, one for coefficients (with just a couple of replicates) and another for SEs (with as many as you can tolerate). run your coefficients interactively and refine which numbers you need during the day, then leave the bigger processes that require SE calculations running overnight
这篇关于R中大型复杂调查数据集的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!