本文介绍了加快插值练习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我在大约120万次观察中运行大约45,000个局部线性回归(因此),所以我很感激一些帮助,因为我不耐烦。 我基本上是试图建立逐个工资合同 - 一组公司的职能工资(给定公司,年份,职位的经验)。 这是我使用的数据集(基本结构): 工资公司年薪职位 1:0007 1996 4 1 20029 2:0007 1996 4 1 23502 3:0007 1996 4 1 22105 4:0007 1996 4 2 23124 5:0007 1996 4 2 22700 --- 1175141:994 2012 5 2 47098 1175142:994 2012 5 2 45488 1175143: 994 2012 5 2 47098 1175144:994 2012 5 3 45488 1175145:994 2012 5 3 47098 我想为所有公司的经验水平0到40构建工资函数,a: > salary_scales 公司年薪头寸 1:0007 1996 4 0 NA 2:0007 1996 4 1 21878.67 3:0007 1996 4 2 23401.33 4:0007 1996 4 3 23705.00 5:0007 1996 4 4 24260.00 --- 611019:9911 2015 4 36 NA 611020:9911 2015 4 37 NA 611021: 9911 2015 4 38 NA 611022:9911 2015 4 39 NA 611023:9911 2015 4 40 NA 为此,我一直在工作(根据@BondedDust 此处)与 COBS (COnstrained B-Spline)包,它允许我建立工资合同的单调性。 一些问题依然存在;特别是当我需要外推时(每当一个特定公司没有任何年轻或非常老的员工时),合适的倾向是失去单调性或下降到0以下。 $为了解决这个问题,我一直在数据边界外使用简单的线性外推 - 在 min_exp 和 max_exp ,以便它通过两个最低(或最高)的拟合点 - 不完美,但似乎做得很好。 记住这一点,这里是我到目前为止(记住我是一个 data.table 狂热): #get每个公司的经验范围工资[,min_exp: = min(exp),by =。(year,firm,position)] 工资[,max_exp:= max(exp) t插值如果只有2或3个独特的经验单元表示工资[,node_count:=长度(唯一(exp)),由=。(年,公司,位置)] #教师太少工资[,ind_count:=。N,by =。(年,公司,职位)] #如果工资变化很小,也很麻烦:工资[,sal_scale_flag:= mean(abs(salary_mean(salary)))工资[,sal_count_flag:=长度; 5,by =。(year,firm,position)] cobs_extrap< -function(exp,salary,min_exp,max_exp, constraint =increase,print.mesg = ,nknots = 8, keep.data = F,maxiter = 150){ #these作为向量传递 min_exp max_exp #get in-sample fit in_sample< -predict(cobs(x = exp,y = salary, constraint = constraint, print。 mesg = print.mesg,nknots = nknots, keep.data = keep.data,maxiter = maxiter),z = min_exp:max_exp)[,fit] #append by min_exp c(if(min_exp == 1)NULL else in_sample [1] - (min_exp:1)*(in_sample [2] -in_sample [1]),in_sample, #通过线性扩展超过max_exp if(max_exp == 40)NULL else in_sample [length(in_sample)] +(1:(40-max_exp))* (in_sample [length (in_sample)] - in_sample [length(in_sample)-1]))} salary_scales 工资[node_count> = 7& ind_count> = 10 & sal_scale_flag == 0& sal_count_flag == 0,。(exp = 0:40, salary = cobs_extrap(exp,salary,min_exp,max_exp)), by =。 (年,公司,职位)] 注意什么可能会减慢我的代码? 这里有一些较小的公司位置组合: 公司年薪位置公布薪金计数 1:0063 2010 5 2 37433 10 2:0063 2010 5 2 38749 10 3:0063 2010 5 4 38749 10 4:0063 2010 5 8 42700 10 5:0063 2010 5 11 47967 10 6:0063 2010 5 15 50637 10 7:0063 2010 5 19 51529 10 8:0063 2010 5 23 50637 10 9:0063 2010 5 33 52426 10 10:0063 2010 5 37 52426 10 11:9908 2006 4 1 26750 10 12:9908 2006 4 6 36043 10 13:9908 2006 4 7 20513 10 14:9908 2006 4 8 45023 10 15:9908 2006 4 13 33588 10 16:9908 2006 4 15 46011 10 17:9908 2006 4 15 37179 10 18:9908 2006 4 22 43704 10 19:9908 2006 4 28 56078 10 20:9908 2006 4 29 44866 10 解决方案很多事情在你的代码中可以改进,但让我们专注于这里的主要瓶颈。手头的问题可以被视为尴尬地并行问题。这意味着您的数据可以拆分为多个较小的部分,每个都可以单独计算在单独的线程,没有任何额外的开销。 要查看当前问题的并行化可能性,您应首先注意,您对每个单独的公司和/或年份分别执行完全相同的计算。例如,您可以在每个年份的较小子任务中分割计算,然后将这些子任务分配给不同的CPU / GPU内核。以这种方式可以获得显着的性能增益。 最后,当子任务的处理完成时,你仍然需要做的只是合并结果。 但是,R和它的所有内部库作为单个线程运行。您必须显式拆分数据,然后将子任务分配给不同的内核。为了实现这一点,存在多个支持多线程的R包。我们将在这里的示例中使用 doparallel 包。 您没有提供足够大的显式数据集以有效地测试性能,因此我们将首先创建一些随机数据: set.seed(42)工资< -data.table(firm = substr(10001:10010,2,5)[sample(10,size = 1e6,replace = T)], year = round(unif(1e6,1996,2015) position = round(runif(1e6,4,5)), exp = round(runif(1e6,1,40)), salary = round(exp(rnorm(1e6,平均值= 10.682,sd = .286))))>工资固定年头寸工资 1:0001 1996 4 14 66136 2:0001 1996 4 3 42123 3:0001 1996 4 9 46528 4:0001 1996 4 11 35195 5:0001 1996 4 2 43926 --- 999996:0010 2015 5 11 43140 999997:0010 2015 5 23 64025 999998: 0010 2015 5 31 35266 999999:0010 2015 5 11 36267 1000000:0010 2015 5 7 44315 现在,让您运行代码的第一部分: #get每个公司的经验范围工资[,min_exp:= min(exp),by =。(year,firm,position)] 工资[,max_exp:= max )] #如果只有2或3个独特的经验单元格,则无法插值工资[,node_count:=长度(唯一(exp)))by =(年, ] #如果有太少的教师工资[,ind_count:=。N,by =。(年,公司,职位)] #工资如下:工资[,sal_scale_flag:=平均值(abs(salary-mean(salary)))工资[,sal_count_flag: = length(unique(salary))>工资公司年度头寸exp salary min_exp max_exp node_count ind_count sal_scale_flag sal_count_flag 1:0001 1996 4 14 66136 1 40 40 1373 FALSE FALSE 2:0001 1996 4 3 42123 1 40 40 1373 FALSE FALSE 3:0001 1996 4 9 46528 1 40 40 1373 FALSE FALSE 4:0001 1996 4 11 35195 1 40 40 1373 FALSE FALSE 5:0001 1996 4 2 43926 1 40 40 1373 FALSE FALSE --- 999996:0010 2015 5 11 43140 1 40 40 1326 FALSE FALSE 999997:0010 2015 5 23 64025 1 40 40 1326 FALSE FALSE 999998:0010 2015 5 31 35266 1 40 40 1326 FALSE FALSE 999999:0010 2015 5 11 36267 1 40 40 1326 FALSE FALSE 1000000:0010 2015 5 7 44315 1 40 40 1326 FALSE FALSE 我们现在将以单线程方式处理工资之前。注意,我们首先保存原始数据,以便我们以后可以执行多线程操作,并比较结果: start< - Sys.time() salary_scales_1< - 工资[node_count> = 7& ind_count> = 10 & sal_scale_flag == 0& sal_count_flag == 0,。 (粘贴(无并行化时间:))(b)(b)(b)(b)(b)(b) ,Sys.time() - start))>打印(粘贴(无并行化时间:,Sys.time() - 开始)) [1]无并行化时间:1.13971961339315& salary_scales_1 企业年薪职位 1:0001 1996 4 0 43670.14 2:0001 1996 4 1 43674.00 3:0001 1996 4 2 43677.76 4:0001 1996 4 3 43681.43 5:0001 1996 4 4 43684.99 --- 16396:0010 2015 5 36 44464.02 16397:0010 2015 5 37 44468.60 16398: 0010 2015 5 38 44471.35 16399:0010 2015 5 39 44472.27 16400:0010 2015 5 40 43077.70 处理一切需要大约1分8秒。注意,我们在虚拟示例中只有10个不同的公司,这就是为什么处理时间与您的本地结果相比没有那么大。 现在,以并行方式执行此任务。如上所述,对于我们的示例,我们希望每年分割数据,并将较小的子部分分配给不同的核心。我们将使用 doParallel 包来实现此目的: 我们需要做的第一件事是创建一个具有特定数量的核的集群。在我们的示例中,我们将尝试使用所有可用的内核。接下来,我们必须注册集群并将一些变量导出到子节点的全局环境。在这种情况下,子节点只需要访问工资。此外,一些依赖库还需要在节点上进行评估,以使其工作。在这种情况下,节点需要访问 data.frame 和 cobs 库。代码如下所示: 库(doParallel) start< - Sys.time() cl< - makeCluster(detectCores()); registerDoParallel(cl); clusterExport(cl,c(wages),envir = environment()); clusterEvalQ(cl,library(data.table)); clusterEvalQ(cl,library(cobs)); salary_scales_2< - foreach(i = 1996:2015)%dopar% { subSet< - 工资[。(i)]#二进制子集 subSet [node_count> = 7& ind_count> = 10 & sal_scale_flag == 0& sal_count_flag == 0,(exp = 0:40, salary = cobs_extrap(exp,salary,min_exp,max_exp ), by =。(firm,year,position)] } stopCluster(cl) print(With parallelisation time:,Sys.time )-start))>打印(粘贴(并行化时间:,Sys.time() - 开始)) [1]并行化时间:23.4177722930908 我们现在有一个数据表列表 salary_scales_2 ,其中包含每个年份的子结果。注意处理时间的加速:这次只需要23秒,而不是原来的1.1分钟( 65%的改进)。我们现在仍然需要做的唯一的事情是合并结果。我们可以使用 do.call(rbind,salary_scales_2),以便将表的行合并在一起(这几乎没有时间 - 。 )。最后,我们还执行一个小型检查以验证多线程结果与单线程运行的结果是否完全相同: salary_scales_2< -do.call(rbind,salary_scales_2) identical(salary_scales_1,salary_scales_2)> same(salary_scales_1,salary_scales_2) [1] TRUE / strong> 这是一个非常有趣的例子,但我想你可能会错过更重要的问题。 data.table 确实执行内存和结构相关的优化,以便以更有效的方式查询和访问您的数据。然而,在这个例子中没有严重的内存或搜索相关的瓶颈,特别是当你在 cobs 函数中与实际的总数据处理时间进行比较时。例如,您更改子集的行每次调用只需约0.04秒。 如果你在你的运行中使用一个profiler,你会注意到它不是 data.table 操作或分组,它是并行化的,它是 cobs 函数占用几乎所有的处理时间(和这个功能甚至不使用 data.table 作为输入)。我们在示例中尝试做的是将 cobs 函数的总工作负载重新分配给不同的内核,以实现我们的加速。我们的目的是不分割 data.table 操作,因为它们根本不昂贵。然而,我们确实必须拆分data.table,因为我们需要拆分单独的 cobs 函数运行的数据。实际上,我们甚至利用了在分割和合并表时所有方面 data.table 是高效的。这不需要额外的时间。 I'm running about 45,000 local linear regressions (essentially) on about 1.2 million observations, so I'd appreciate some help trying to speed things up because I'm impatient.I'm basically trying to construct year-by-position wage contracts--the function wage(experience given firm,year,position)--for a bunch of firms.Here's the data (basic structure of) set I'm working with:> wages firm year position exp salary 1: 0007 1996 4 1 20029 2: 0007 1996 4 1 23502 3: 0007 1996 4 1 22105 4: 0007 1996 4 2 23124 5: 0007 1996 4 2 22700 ---1175141: 994 2012 5 2 470981175142: 994 2012 5 2 454881175143: 994 2012 5 2 470981175144: 994 2012 5 3 454881175145: 994 2012 5 3 47098I want to construct the wage function for experience levels 0 through 40 for all firms, a la:> salary_scales firm year position exp salary 1: 0007 1996 4 0 NA 2: 0007 1996 4 1 21878.67 3: 0007 1996 4 2 23401.33 4: 0007 1996 4 3 23705.00 5: 0007 1996 4 4 24260.00 ---611019: 9911 2015 4 36 NA611020: 9911 2015 4 37 NA611021: 9911 2015 4 38 NA611022: 9911 2015 4 39 NA611023: 9911 2015 4 40 NATo that end, I've been working (at the suggestion of @BondedDust here) with the COBS (COnstrained B-Spline) package, which allows me to build in the monotonicity of the wage contract.Some problems remain; in particular, when I need to extrapolate (whenever a given firm doesn't have any very young or very old employees), there's a tendency for the fit to lose monotonicity or to drop below 0.To get around this, I've been using simple linear extrapolation outside the data bounds--extend the fit curve outside min_exp and max_exp so that it passes through the two lowest (or highest) fit points--not perfect, but it seems to be doing pretty well.With that in mind, here's how I'm doing this so far (keep in mind I'm a data.table fanatic):#get the range of experience for each firmwages[,min_exp:=min(exp),by=.(year,firm,position)]wages[,max_exp:=max(exp),by=.(year,firm,position)]#Can't interpolate if there are only 2 or 3 unique experience cells representedwages[,node_count:=length(unique(exp)),by=.(year,firm,position)]#Nor if there are too few teacherswages[,ind_count:=.N,by=.(year,firm,position)]#Also troublesome when there is little variation in salaries like so:wages[,sal_scale_flag:=mean(abs(salary-mean(salary)))<50,by=.(year,firm,position)]wages[,sal_count_flag:=length(unique(salary))<5,by=.(year,firm,position)]cobs_extrap<-function(exp,salary,min_exp,max_exp, constraint="increase",print.mesg=F,nknots=8, keep.data=F,maxiter=150){ #these are passed as vectors min_exp<-min_exp[1] max_exp<-min(max_exp[1],40) #get in-sample fit in_sample<-predict(cobs(x=exp,y=salary, constraint=constraint, print.mesg=print.mesg,nknots=nknots, keep.data=keep.data,maxiter=maxiter), z=min_exp:max_exp)[,"fit"] #append by linear extension below min_exp c(if (min_exp==1) NULL else in_sample[1]- (min_exp:1)*(in_sample[2]-in_sample[1]),in_sample, #append by linear extension above max_exp if (max_exp==40) NULL else in_sample[length(in_sample)]+(1:(40-max_exp))* (in_sample[length(in_sample)]-in_sample[length(in_sample)-1]))}salary_scales<- wages[node_count>=7&ind_count>=10 &sal_scale_flag==0&sal_count_flag==0, .(exp=0:40, salary=cobs_extrap(exp,salary,min_exp,max_exp)), by=.(year,firm,position)]Notice anything in particular that could be slowing down my code? Or am I forced to be patient?To play around with here are some of the smaller firm-position combos: firm year position exp salary count 1: 0063 2010 5 2 37433 10 2: 0063 2010 5 2 38749 10 3: 0063 2010 5 4 38749 10 4: 0063 2010 5 8 42700 10 5: 0063 2010 5 11 47967 10 6: 0063 2010 5 15 50637 10 7: 0063 2010 5 19 51529 10 8: 0063 2010 5 23 50637 10 9: 0063 2010 5 33 52426 1010: 0063 2010 5 37 52426 1011: 9908 2006 4 1 26750 1012: 9908 2006 4 6 36043 1013: 9908 2006 4 7 20513 1014: 9908 2006 4 8 45023 1015: 9908 2006 4 13 33588 1016: 9908 2006 4 15 46011 1017: 9908 2006 4 15 37179 1018: 9908 2006 4 22 43704 1019: 9908 2006 4 28 56078 1020: 9908 2006 4 29 44866 10 解决方案 There are a lot of things in your code that could be improved, but let's focus on the main bottleneck here. The problem at hand can be considered as an embarrassingly parallel problem. This means that your data can be split up into multiple smaller pieces that can each be computed on separate threads individually without any extra overhead.To see the parallelisation possibilities for the current problem you should first note that you are performing the exact same calculations for each of the individual firms and/or years separately. You could for example split up the calculations in smaller subtasks for each individual year and then allocate these subtasks to different CPU/GPU cores. A significant performance gain can be obtained in this manner.Finally, when the processing of the subtasks is done, the only thing you still need to do is merge the results.However, R and all its internal libraries run as a single thread. You will have to explicitly split up your data and then assign the subtasks to different cores. In order to achieve this, there exist a number of R packages that support multithreading. We will use the doparallel package in our example here.You did not provide an explicit dataset that is big enough to effectively test the performance so we will first create some random data:set.seed(42)wages<-data.table(firm=substr(10001:10010,2,5)[sample(10,size=1e6,replace=T)], year=round(unif(1e6,1996,2015)), position=round(runif(1e6,4,5)), exp=round(runif(1e6,1,40)), salary=round(exp(rnorm(1e6,mean=10.682,sd=.286))))> wages firm year position exp salary 1: 0001 1996 4 14 66136 2: 0001 1996 4 3 42123 3: 0001 1996 4 9 46528 4: 0001 1996 4 11 35195 5: 0001 1996 4 2 43926 --- 999996: 0010 2015 5 11 43140 999997: 0010 2015 5 23 64025 999998: 0010 2015 5 31 35266 999999: 0010 2015 5 11 362671000000: 0010 2015 5 7 44315Now, lets run the first part of your code:#get the range of experience for each firmwages[,min_exp:=min(exp),by=.(year,firm,position)]wages[,max_exp:=max(exp),by=.(year,firm,position)]#Can't interpolate if there are only 2 or 3 unique experience cells representedwages[,node_count:=length(unique(exp)),by=.(year,firm,position)]#Nor if there are too few teacherswages[,ind_count:=.N,by=.(year,firm,position)]#Also troublesome when there is little variation in salaries like so:wages[,sal_scale_flag:=mean(abs(salary-mean(salary)))<50,by=.(year,firm,position)]wages[,sal_count_flag:=length(unique(salary))<5,by=.(year,firm,position)]> wages firm year position exp salary min_exp max_exp node_count ind_count sal_scale_flag sal_count_flag 1: 0001 1996 4 14 66136 1 40 40 1373 FALSE FALSE 2: 0001 1996 4 3 42123 1 40 40 1373 FALSE FALSE 3: 0001 1996 4 9 46528 1 40 40 1373 FALSE FALSE 4: 0001 1996 4 11 35195 1 40 40 1373 FALSE FALSE 5: 0001 1996 4 2 43926 1 40 40 1373 FALSE FALSE --- 999996: 0010 2015 5 11 43140 1 40 40 1326 FALSE FALSE 999997: 0010 2015 5 23 64025 1 40 40 1326 FALSE FALSE 999998: 0010 2015 5 31 35266 1 40 40 1326 FALSE FALSE 999999: 0010 2015 5 11 36267 1 40 40 1326 FALSE FALSE1000000: 0010 2015 5 7 44315 1 40 40 1326 FALSE FALSEWe will now process the wages in a single threaded manner as you have done before. Note that we first save the original data so that we can perform multithreaded operations on it later and compare the results:start <- Sys.time()salary_scales_1 <- wages[node_count>=7&ind_count>=10 &sal_scale_flag==0&sal_count_flag==0, .(exp=0:40,salary=cobs_extrap(exp,salary,min_exp,max_exp)), by=.(firm,year,position)]print(paste("No Parallelisation time: ",Sys.time()-start))> print(paste("No Parallelisation time: ",Sys.time()-start))[1] "No Parallelisation time: 1.13971961339315"> salary_scales_1 firm year position exp salary 1: 0001 1996 4 0 43670.14 2: 0001 1996 4 1 43674.00 3: 0001 1996 4 2 43677.76 4: 0001 1996 4 3 43681.43 5: 0001 1996 4 4 43684.99 ---16396: 0010 2015 5 36 44464.0216397: 0010 2015 5 37 44468.6016398: 0010 2015 5 38 44471.3516399: 0010 2015 5 39 44472.2716400: 0010 2015 5 40 43077.70It took about 1 minute, 8 seconds to process everything. Note that we only have 10 different firms in our dummy example, this is why the processing time is not that significant in comparison to your local results.Now, let's try to perform this task in a parallelised manner. As mentioned, for our example we would like to split up the data per year and assign the smaller subparts to separate cores. We will use the doParallel package for this purpose:The first thing that we will need to do is create a cluster with a particular number of cores. In our example we will try to use all the available cores. Next, we will have to register the cluster and export some variables to the global environments of the subnodes. In this case the subnodes only need access to wages. Additionally, some of the dependent libraries will also need to be evaluated on the nodes in order to make it work. In this case the nodes need access to both the data.frame and cobs libraries. The code looks like this:library(doParallel)start <- Sys.time()cl <- makeCluster(detectCores());registerDoParallel(cl);clusterExport(cl,c("wages"),envir=environment());clusterEvalQ(cl,library("data.table"));clusterEvalQ(cl,library("cobs"));salary_scales_2 <- foreach(i = 1996:2015) %dopar% { subSet <- wages[.(i)] # binary subsetting subSet[node_count>=7&ind_count>=10 &sal_scale_flag==0&sal_count_flag==0, .(exp=0:40, salary=cobs_extrap(exp,salary,min_exp,max_exp)), by=.(firm,year,position)] }stopCluster(cl)print(paste("With parallelisation time: ",Sys.time()-start))> print(paste("With parallelisation time: ",Sys.time()-start))[1] "With parallelisation time: 23.4177722930908"We now have a list of data tables salary_scales_2 that contains the subresults for each invididual year. Notice the speedup in processing time: This time it took only 23 seconds instead of the original 1.1 minutes (65% improvement). The only thing that we still need to do now is merge the results. We can use do.call("rbind", salary_scales_2) in order to merge the rows of the tables together (this takes almost no time--.002 seconds on one run). Finally, we also perform a small check to verify that the multithreaded results are indeed identical to the results of the single threaded run:salary_scales_2<-do.call("rbind",salary_scales_2)identical(salary_scales_1,salary_scales_2)> identical(salary_scales_1,salary_scales_2)[1] TRUEREPLY TO COMMENTIt's a very interesting example indeed but I think you might be missing the more important issue here. The data.table indeed performs memory and structure related optimizations in order for you to query and access your data in a more efficient way. However, in this example there is no major memory or search related bottleneck, especially not when you make comparisons with the actual total data crunching time in the cobs function. For example, the line that you changed subSet <- wages[year==uniqueYears[i],] takes only about 0.04 seconds per call when you time it.If you use a profiler on your runs then you will notice that it is not the data.table or any of its operations or groupings that beg for parallelisation, it is the cobs function that takes up almost all of the processing time (and this function doesn't even use a data.table as input). What we are trying to do in the example is reassigning our total workload of the cobs function to different cores in order to achieve our speedup. Our intention is not to split up the data.table operations since they are not costly at all. However, we indeed have to split up our data.table as a result of the fact that we need to split up the data for the separate cobs function runs. In fact, we have even taken advantage of the fact that the data.table is efficient in all regards while splitting and merging the table(s). This took no additional time at all. 这篇关于加快插值练习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
09-11 15:09