问题描述
我有一个很大的列表(约30GB),其功能如下:
I have a large list (~30GB) and functions as follows:
cl <- makeCluster(24, outfile = "")
Foo1 <- function(cl, largeList) {
return(parLapply(cl, largeList, Bar))
}
Bar1 <- function(listElement) {
return(nrow(listElement))
}
Foo2 <- function(cl, largeList, arg) {
clusterExport(cl, list("arg"), envir = environment())
return(parLapply(cl, largeList, function(x) Bar(x, arg)))
}
Bar2 <- function(listElement, arg) {
return(nrow(listElement))
}
没有问题:
Foo1(cl, largeList)
观察每个进程的内存使用情况,我可以看到只有一个列表元素被复制到每个节点.
Watching the memory usage for each process I can see that only one list element is being copied to each node.
但是,在致电时:
Foo2(cl, largeList, 0)
largeList的副本正在复制到每个节点.逐步执行Foo2,不会在clusterExport上进行largeList复制,而是在parLapply上进行.另外,当我从全局环境(不在函数内)执行Foo2的主体时,也没有问题.是什么原因造成的?
a copy of largeList is being copied to each node. Stepping through Foo2, the largeList copying is not happening at clusterExport, but rather on parLapply. Also, when I execute the body of Foo2 from the global environment (not within a function), there are no issues. What is causing this?
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 21 (Twenty One)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel splines stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] xts_0.9-7 zoo_1.7-12 snow_0.3-13
[4] Rcpp_0.12.2 randomForest_4.6-12 gbm_2.1.1
[7] lattice_0.20-33 survival_2.38-3 e1071_1.6-7
loaded via a namespace (and not attached):
[1] class_7.3-13 tools_3.2.2 grid_3.2.2
推荐答案
问题是,作为parLapply
的第三个参数的worker函数将被序列化,并与输入数据一起发送给每个worker.如果worker函数是在诸如Foo2
之类的函数中定义的,则本地环境将与其一起序列化.由于largeList
是Foo2
的参数,因此它在本地环境中,因此与worker函数一起进行了序列化.
The problem is that the worker function, which is the third argument to parLapply
, is serialized and sent to each of the workers along with the input data. If the worker function is defined inside a function, such as Foo2
, then the local environment is serialized along with it. Since largeList
is an argument to Foo2
, it is in the local environment, and therefore serialized along with the worker function.
您对Foo1
没什么问题,因为Bar
大概是在全局环境中创建的,并且全局环境永远不会与函数一起序列化.
You didn't have a problem with Foo1
because Bar
was presumably created in the global environment, and the global environment is never serialized along with functions.
换句话说,最好在使用parLapply
,clusterApply
,clusterApplyLB
等时始终在全局环境或程序包中定义辅助函数.当然,如果要调用parLapply
从全局环境中,匿名函数是在全局环境中定义的.
In other words, it's a good idea to always define the worker function in the global environment or in a package when using parLapply
, clusterApply
, clusterApplyLB
, etc. Of course, if you're calling parLapply
from the global environment, the anonymous function is defined in the global environment.
这篇关于内部函数中的parLapply意外将数据复制到节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!