问题描述
我想在函数中使用plyr
包的并行功能.
I want to use the parallel functionality of the plyr
package within functions.
我会认为,导出在函数体内创建的对象的正确方法如下(在本示例中,对象为df_2
)
I would have thought that the proper way to export objects that have been created within the body of the function (in this example, the object is df_2
) is as follows
# rm(list=ls())
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#export df_2 via .paropts
ddply(df_1,"type",.parallel=TRUE,.paropts=list(.export="df_2"),.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test()
stopCluster(workers)
但是,这会引发错误
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "df_2"
因此,我进行了一些研究,发现如果我手动导出df_2
,它会起作用
So I did some research and found out that it works if I export df_2
manually
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test_2=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#manually export df_2
clusterExport(cl=workers,varlist=list("df_2"),envir=environment())
ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test_2()
stopCluster(workers)
它给出正确的结果
type x.x x.y
1 a 1 3
2 b 2 4
但是我也发现以下代码有效
But I have also found out that the following code works
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test_3=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#no export at all!
ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test_3()
stopCluster(workers)
plyr_test_3()
也会给出正确的结果,我不明白为什么.我本以为我必须导出df_2
...
plyr_test_3()
also gives the correct result and I don't understand why. I would have thought that I have to export df_2
...
我的问题是:在函数中处理并行*ply
的正确方法是什么?显然,plyr_test()
是不正确的.我以某种方式感到plyr_test_2()
中的手动导出是无用的.但是我也认为plyr_test_3()
是一种不好的编码风格.有人可以详细说明吗?谢谢大家!
My question is: What is the right way to deal with parallel *ply
within functions? Obviously, plyr_test()
is incorrect. I somehow have the feeling that the manual export in plyr_test_2()
is useless. But I also think that plyr_test_3()
is kind of bad coding style. Could someone please elaborate on that? Thanks guys!
推荐答案
plyr_test
的问题是df_2
是在plyr_test
中定义的,无法从doParallel
程序包访问它,因此失败尝试导出df_2
时.因此,这是一个范围界定问题. plyr_test2
避免了此问题,因为它不会尝试使用.export
选项,但是正如您所猜测的,不需要调用clusterExport
.
The problem with plyr_test
is that df_2
is defined in plyr_test
which isn't accessible from the doParallel
package, and therefore it fails when it tries to export df_2
. So that is a scoping issue. plyr_test2
avoids this problem because is doesn't try to use the .export
option, but as you guessed, the call to clusterExport
is not needed.
plyr_test2
和plyr_test3
均成功的原因是df_2
与匿名函数一起被序列化,该匿名函数通过.fun
参数传递给ddply
函数.实际上,df_1
和df_2
都与匿名函数一起被序列化,因为该函数是在plyr_test2
和plyr_test3
内部定义的.在这种情况下包含df_2
会很有帮助,但是不必包含df_1
,这可能会损害您的性能.
The reason that both plyr_test2
and plyr_test3
succeed is that df_2
is serialized along with the anonymous function that is passed to the ddply
function via the .fun
argument. In fact, both df_1
and df_2
are serialized along with the anonymous function because that function is defined inside plyr_test2
and plyr_test3
. It's helpful that df_2
is included in this case, but the inclusion of df_1
is unnecessary and may hurt your performance.
只要在匿名函数的环境中捕获了df_2
,无论您导出什么内容,都不会使用df_2
的其他值.除非您可以阻止捕获它,否则用.export
或clusterExport
导出它是没有意义的,因为将使用捕获的值.通过尝试将其导出到工作人员,您只会遇到麻烦(就像您执行.export
一样).
As long as df_2
is captured in the environment of the anonymous function, no other value of df_2
will ever be used, regardless of what you export. Unless you can prevent it from being captured, it is pointless to export it either with .export
or clusterExport
because the captured value will be used. You can only get yourself into trouble (as you did the .export
) by trying to export it to the workers.
请注意,在这种情况下,foreach不会自动导出df_2
,因为它无法分析匿名函数的主体以查看所引用的符号.如果您直接调用foreach而不使用匿名函数,则它将看到该引用并自动将其导出,从而无需使用.export
显式导出它.
Note that in this case, foreach does not auto-export df_2
because it isn't able to analyze the body of the anonymous function to see what symbols are referenced. If you call foreach directly without using an anonymous function, then it will see the reference and auto-export it, making it unnecessary to explicitly export it using .export
.
您可以通过在将plyr_test
的环境传递给ddply
之前对其环境进行修改来防止其与匿名函数一起被序列化:
You could prevent the environment of plyr_test
from being serialized along with the anonymous function by modifying it's environment before passing it to ddply
:
plyr_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
clusterExport(cl=workers,varlist=list("df_2"),envir=environment())
fun=function(y) merge(y, df_2, all=FALSE, by="type")
environment(fun)=globalenv()
ddply(df_1,"type",.parallel=TRUE,.fun=fun)
}
foreach
软件包的优点之一是,它不鼓励您在另一个函数内部创建一个可能会意外捕获大量变量的函数.
One of the advantages of the foreach
package is that it doesn't encourage you to create a function inside of another function that might be capturing a bunch of variables accidentally.
此问题向我建议foreach
应包含一个名为.exportenv
的选项,该选项类似于clusterExport
envir
选项.这对于plyr
非常有用,因为它将允许使用.export
正确导出df_2
.但是,除非从.fun
函数中删除了包含df_2
的环境,否则仍不会使用该导出的值.
This issue suggests to me that foreach
should include an option called .exportenv
that is similar to the clusterExport
envir
option. That would be very helpful for plyr
, since it would allow df_2
to be correctly exported using .export
. However, that exported value still wouldn't be used unless the environment containing df_2
was removed from the .fun
function.
这篇关于在函数中并行* ply的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!