问题描述
我对将Julia SharedArray
s用于科学计算项目感兴趣.我当前的实现对所有矩阵矢量运算都吸引BLAS,但我认为SharedArray
可能会在多核计算机上提供一些加速.我的想法是简单地按索引更新输出向量,将索引更新填充到工作进程中.
先前的讨论此处关于SharedArray
s和此处关于共享内存对象的内容均未提供明确的指导.从直观上看似乎很简单,但是经过测试,我对这种方法为何如此差的效果感到困惑(请参见下面的代码).对于初学者来说,@parallel for
似乎分配了很多内存.如果我在循环前面加上@sync
前缀(如果以后需要整个输出向量,这似乎是明智的选择),那么并行循环就会慢很多(尽管没有@sync
时,循环会非常快). /p>
我是否错误地解释了SharedArray
对象的正确用法?还是我没有有效地分配计算结果?
### test for speed gain w/ SharedArray vs. Array ###
# problem dimensions
n = 10000; p = 25000
# set BLAS threads; 64 seems reasonable in testing
blas_set_num_threads(64)
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray{Float64,2}, x)
Y = convert(SharedArray{Float64,1}, y)
Z = convert(SharedArray{Float64,1}, z)
# run BLAS.gemv! on Arrays twice, time second case
BLAS.gemv!('N', 1.0, x, y, 0.0, z)
@time BLAS.gemv!('N', 1.0, x, y, 0.0, z)
# does BLAS work equally well for SharedArrays?
# check timing result and ensure same answer
BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
@time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
println("$(isequal(z,Z))") # should be true
# SharedArrays can be updated in parallel
# code a loop to farm updates to worker nodes
# use transposed X to place rows of X in columnar format
# should (hopefully) help with performance issues from stride
Xt = X'
@parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# now time the synchronized copy of this
@time @sync @parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# still get same result?
println("$(isequal(z,Z))") # should be true
具有4个工作人员+ 1个主节点的测试输出:
elapsed time: 0.109010169 seconds (80 bytes allocated)
elapsed time: 0.110858551 seconds (80 bytes allocated)
true
elapsed time: 1.726231048 seconds (119936 bytes allocated)
true
您遇到了几个问题,其中最重要的是Xt[:,i]
创建了一个新数组(分配内存).这是一个使您更接近所需内容的演示:
n = 10000; p = 25000
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray, x)
Y = convert(SharedArray, y)
Z = convert(SharedArray, z)
Xt = X'
@everywhere function dotcol(a, B, j)
length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows"))
s = 0.0
@inbounds @simd for i = 1:length(a)
s += a[i]*B[i,j]
end
s
end
function run1!(Z, Y, Xt)
for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
function runp!(Z, Y, Xt)
@sync @parallel for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
run1!(Z, Y, Xt)
runp!(Z, Y, Xt)
@time run1!(Z, Y, Xt)
zc = copy(sdata(Z))
fill!(Z, -1)
@time runp!(Z, Y, Xt)
@show sdata(Z) == zc
结果(启动julia -p 8
时):
julia> include("/tmp/paralleldot.jl")
elapsed time: 0.465755791 seconds (80 bytes allocated)
elapsed time: 0.076751406 seconds (282 kB allocated)
sdata(Z) == zc = true
为进行比较,在同一台计算机上运行时:
julia> blas_set_num_threads(8)
julia> @time A_mul_B!(Z, X, Y);
elapsed time: 0.067611858 seconds (80 bytes allocated)
因此,Julia的原始实现至少可以与BLAS竞争.
I am interested in using Julia SharedArray
s for a scientific computing project. My current implementation appeals to BLAS for all matrix-vector operations, but I thought that perhaps a SharedArray
would offer some speedup on multicore machines. My idea is to simply update an output vector index-by-index, farming the index updates to worker processes.
Previous discussions here about SharedArray
s and here about shared memory objects did not offer clear guidance on this issue. It seems intuitively simple enough, but after testing, I'm somewhat confused as to why this approach works so poorly (see code below). For starters, it seems like @parallel for
allocates a lot of memory. And if I prefix the loop with @sync
, which seems like a smart thing to do if the whole output vector is required later, then the parallel loop is substantially slower (though without @sync
, the loop is mighty quick).
Have I incorrectly interpreted the proper use of the SharedArray
object? Or perhaps did I inefficiently assign the calculations?
### test for speed gain w/ SharedArray vs. Array ###
# problem dimensions
n = 10000; p = 25000
# set BLAS threads; 64 seems reasonable in testing
blas_set_num_threads(64)
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray{Float64,2}, x)
Y = convert(SharedArray{Float64,1}, y)
Z = convert(SharedArray{Float64,1}, z)
# run BLAS.gemv! on Arrays twice, time second case
BLAS.gemv!('N', 1.0, x, y, 0.0, z)
@time BLAS.gemv!('N', 1.0, x, y, 0.0, z)
# does BLAS work equally well for SharedArrays?
# check timing result and ensure same answer
BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
@time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
println("$(isequal(z,Z))") # should be true
# SharedArrays can be updated in parallel
# code a loop to farm updates to worker nodes
# use transposed X to place rows of X in columnar format
# should (hopefully) help with performance issues from stride
Xt = X'
@parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# now time the synchronized copy of this
@time @sync @parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# still get same result?
println("$(isequal(z,Z))") # should be true
Output from test with 4 workers + 1 master node:
elapsed time: 0.109010169 seconds (80 bytes allocated)
elapsed time: 0.110858551 seconds (80 bytes allocated)
true
elapsed time: 1.726231048 seconds (119936 bytes allocated)
true
You're running into several issues, of which the most important is that Xt[:,i]
creates a new array (allocating memory). Here's a demonstration that gets you closer to what you want:
n = 10000; p = 25000
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray, x)
Y = convert(SharedArray, y)
Z = convert(SharedArray, z)
Xt = X'
@everywhere function dotcol(a, B, j)
length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows"))
s = 0.0
@inbounds @simd for i = 1:length(a)
s += a[i]*B[i,j]
end
s
end
function run1!(Z, Y, Xt)
for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
function runp!(Z, Y, Xt)
@sync @parallel for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
run1!(Z, Y, Xt)
runp!(Z, Y, Xt)
@time run1!(Z, Y, Xt)
zc = copy(sdata(Z))
fill!(Z, -1)
@time runp!(Z, Y, Xt)
@show sdata(Z) == zc
Results (when starting julia -p 8
):
julia> include("/tmp/paralleldot.jl")
elapsed time: 0.465755791 seconds (80 bytes allocated)
elapsed time: 0.076751406 seconds (282 kB allocated)
sdata(Z) == zc = true
For comparison, when running on this same machine:
julia> blas_set_num_threads(8)
julia> @time A_mul_B!(Z, X, Y);
elapsed time: 0.067611858 seconds (80 bytes allocated)
So the raw Julia implementation is at least competitive with BLAS.
这篇关于BLAS v.Julia SharedArray对象的并行更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!