问题描述
有关性能方面的原因,
我很好奇,如果有一个办法乘以堆栈矩阵的一叠。我有一个4-D阵列(500,201,2,2)。其基本的(201,2,2)500长度堆栈矩阵,其中的每一个500的,我想用einsum乘以相邻矩阵和再弄(201,2,2)矩阵。
I'm curious if there is a way to multiply a stack of a stack of matrices. I have a 4-D array (500, 201, 2, 2). Its basically a 500 length stack of (201,2,2) matrices where for each of the 500, I want to multiply the adjacent matrices using einsum and get another (201,2,2) matrix.
我只是做在末尾的[2×2]矩阵的矩阵乘法。由于我的解释已经走向出轨,我将只显示我现在正在做的,也是'减少'等价和为什么它没有帮助的(因为它以相同的速度计算)。 preferably这将是一个numpy的一个班轮,但我不知道那是什么,或者即使它的可能。
I am only doing matrix multiplication on the [2x2] matrices at the end. Since my explanation is already heading off the rails, I'll just show what I'm doing now, and also the 'reduce' equivalent and why its not helpful (because its the same speed computationally). Preferably this would be a numpy one-liner, but I don't know what that is, or even if its possible.
Arr = rand(500,201,2,2)
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
def myeinsum(A1, A2):
return np.einsum('fij,fjk->fik', A1, A2)
A1 = loopMult(Arr)
A2 = reduce(myeinsum, Arr)
print np.all(A1 == A2)
print shape(A1); print shape(A2)
%timeit loopMult(Arr)
%timeit reduce(myeinsum, Arr)
返回:
True
(201, 2, 2)
(201, 2, 2)
10 loops, best of 3: 34.8 ms per loop
10 loops, best of 3: 35.2 ms per loop
任何帮助将是AP preciated。事情是功能性的,但是当我在一个大的一系列参数来遍历这一点,code往往需要很长的时间,和我不知道是否有通过一个循环,以避免500次迭代的方式
推荐答案
我不认为这是可能做到这一点有效地利用numpy的(即 cumprod
解决方案是优雅,虽然)。这是哪门子的情况下,我会使用 f2py
。这是调用速度更快的语言,我知道的,只需要一个额外的文件,最简单的方法。
I don't think it's possible to do this efficiently using numpy (the cumprod
solution was elegant, though). This is the sort of situation where I would use f2py
. It's the simplest way of calling a faster language that I know of and only requires a single extra file.
fortran.f90:
fortran.f90:
subroutine multimul(a, b)
implicit none
real(8), intent(in) :: a(:,:,:,:)
real(8), intent(out) :: b(size(a,1),size(a,2),size(a,3))
real(8) :: work(size(a,1),size(a,2))
integer i, j, k, l, m
!$omp parallel do private(work,i,j)
do i = 1, size(b,3)
b(:,:,i) = a(:,:,i,size(a,4))
do j = size(a,4)-1, 1, -1
work = matmul(b(:,:,i),a(:,:,i,j))
b(:,:,i) = work
end do
end do
end subroutine
编译与 f2py -c -m FORTRAN fortran.f90
(或 F90FLAGS = - fopenmpf2py -c -m FORTRAN fortran.f90 -lgomp
来启用OpenMP加速)。那么你可以使用它在你的脚本
Compile with f2py -c -m fortran fortran.f90
(or F90FLAGS="-fopenmp" f2py -c -m fortran fortran.f90 -lgomp
to enable OpenMP acceleration). Then you would use it in your script as
import numpy as np, fmuls
Arr = np.random.standard_normal([500,201,2,2])
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
def myeinsum(A1, A2):
return np.einsum('fij,fjk->fik', A1, A2)
A1 = loopMult(Arr)
A2 = reduce(myeinsum, Arr)
A3 = fmuls.multimul(Arr.T).T
print np.allclose(A1,A2)
print np.allclose(A1,A3)
%timeit loopMult(Arr)
%timeit reduce(myeinsum, Arr)
%timeit fmuls.multimul(Arr.T).T
它输出
True
True
10 loops, best of 3: 48.4 ms per loop
10 loops, best of 3: 48.8 ms per loop
100 loops, best of 3: 5.82 ms per loop
所以这是一个因素8加速。之所以所有的转置是 f2py
隐含调换所有的阵列,我们需要手动调换他们告诉它我们FORTRAN code期望的东西被调换。这避免了复制操作。成本是我们每个2×2矩阵的换位,这样可避免反向进行错误操作,我们必须循环。
So that's a factor 8 speedup. The reason for all the transposes is that f2py
implicitly transposes all the arrays, and we need to transpose them manually to tell it that our fortran code expects things to be transposed. This avoids a copy operation. The cost is that each of our 2x2 matrices are transposed, so to avoid performing the wrong operation we have to loop in reverse.
8大于加速比应该是可能的 - 我没有花任何时间去优化这个
Greater speedups than 8 should be possible - I didn't spend any time trying to optimize this.
这篇关于蟒蛇,numpy的,einsum乘矩阵一叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!