本文介绍了如何避免单线程NumPy转置的巨大开销?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

由于 NumPy 的转置功能,我目前遇到了巨大的开销.我发现这个函数几乎总是在单线程中运行,无论转置矩阵/数组有多大.我可能需要避免这种巨大的时间成本.

I currently encounter huge overhead because of NumPy's transpose function. I found this function virtually always run in single-threaded, whatever how large the transposed matrix/array is. I probably need to avoid this huge time cost.

据我所知,如果 numpy 数组足够大,其他函数如 np.dot 或向量增量将并行运行.一些按元素的操作似乎在包 numexpr 中更好地并行化,但 numexpr 可能无法处理转置.

To my understanding, other functions like np.dot or vector increment would run in parallel, if numpy array is large enough. Some element-wise operations seems to be better parallelized in package numexpr, but numexpr probably couldn't handle transpose.

我想了解解决问题的更好方法是什么.详细说明这个问题,

I wish to learn what is the better way to resolve problem. To state this problem in detail,

  • 有时 NumPy 运行转置超快(如 B = A.T),因为转置张量未用于计算或被转储,并且在此阶段不需要真正转置数据.调用 B[:] = A.T 时,确实会进行数据转置.
  • 我认为并行转置函数应该是一种解决方案.问题是如何实现它.
  • 希望该解决方案不需要 NumPy 以外的软件包.ctype 绑定是可以接受的.并且希望代码不要太难用也不要太复杂.
  • 张量转置是一个加分项.尽管转置矩阵的技术也可以用于特定的张量转置问题,但我认为为张量转置编写通用 API 可能很困难.我实际上还需要处理张量转置,但处理张量可能会使这个 stackoverflow 问题复杂化.
  • 如果将来有可能实现并行转置,或者是否有计划?这样就不需要我自己实现转置了;)
  • Sometimes NumPy runs transpose ultrafast (like B = A.T) because the transposed tensor is not used in calculation or be dumped, and there is no need to really transpose data at this stage. When calling B[:] = A.T, that really do transpose of data.
  • I think a parallelized transpose function should be a resolution. The problem is how to implement it.
  • Hope the solution does not require packages other than NumPy. ctype binding is acceptable. And hope code is not too difficult to use nor too complicated.
  • Tensor transpose is a plus. Though techniques to transpose a matrix could be also utilized in specific tensor transpose problem, I think it could be difficult to write a universal API for tensor transpose. I actually also need to handle tensor transpose, but handling tensors could complicate this stackoverflow problem.
  • And if there's possibility to implement parallelized transpose in future, or there's a plan exists? Then there would be no need to implement transpose by myself ;)

预先感谢您的任何建议!

Thanks in advance for any suggestions!

在我的 Linux 个人计算机上处​​理模型转置问题(A 的大小约为 763MB),可用 4 核(CPU 总数为 400%).

Handling a model transpose problem (size of A is ~763MB) on my personal computer in Linux with 4-cores available (400% CPU in total).

A = np.random.random((9999, 10001))
B = np.random.random((10001, 9999))
D = np.random.random((9999, 10001))

目前的解决方法似乎不够有效.一般来说,如果在 4 核 CPU 上完全并行化,它应该会看到大约 3x~4x 的加速,但我编写的代码只获得了大约 1.5x.

The current workarounds seems not to be effective enough. Generally, it should see about 3x~4x speedup if fully parallelized on a 4-core CPU, but the code I've written only gains about 1.5x.

B[:] = A.T   # ~255 ms, time for transpose
D[:] = A[:]  # ~ 65 ms, time for coping data

有趣的是,如果A 是10000 * 10000 方阵,那么转置时间会增加到~310ms.我不知道这里会发生什么.如果矩阵是方阵,即使是 C/ctypes 绑定函数的时间成本也会受到影响(变慢).

It's intersting to find that if A is 10000 * 10000 square matrix, then transpose time will be increased to ~310ms. I don't know what happens here. Even the time cost of C/ctypes binding function will be affected (slower) if matrix is square.

使用以下 OpenMP/BLAS(基本的,未优化的)编写:

With the following OpenMP/BLAS (basic, not optimized) written:

// transpose.c
#include <stdio.h>
#include <cblas.h>
void matrix_transpose(double* A, double* B, size_t d1, size_t d2) {
    size_t i;
    #pragma omp parallel for shared(A, B, d1, d2) private(i)
    for (i = 0; i < d1; ++i) {
        cblas_dcopy(d2, A + i*d2, 1, B + i, d1);
    }
}

然后执行python代码(4核线程)

then execuate python codes (4 core threaded)

from numpy.ctypeslib import as_ctypes
matrix_transpose = np.ctypeslib.load_library("transpose.so", ".").matrix_transpose
matrix_transpose(
    as_ctypes(A), as_ctypes(C),
    ctypes.c_size_t(n1), ctypes.c_size_t(n2))  # ~145 ms

使用 C/ctype 绑定可能有点麻烦,因为它不是纯 python,并且也使用 CBLAS 作为外部包.

It could be somehow cumbersome to use C/ctype binding, since it is not pure python, and uses CBLAS as external package as well.

nproc = 4
batch = n1 // nproc + 1  # batch 2500 in this case
slices = [slice(i * batch, min(n1, (i+1) * batch)) for i in range(nproc)]

cB = as_ctypes(B)
pB = sharedctypes.RawArray(cB._type_, cB)

def slice_transpose(s):
    B = np.asarray(pB)
    B[:, s] = A[s].T

with Pool(nproc) as pool:
    pool.map(slice_transpose, slices)  # ~165 ms
B = np.asarray(pB)

我猜对于大型集群 (HPC),更多的进程/线程不一定会获得加速.那么如何设置进程/线程数也可能是个问题.

I guess that for large clusters (HPCs), more processes/threads does not necessarily gains speedup. So how to set the number of processes/threads may also be a problem.

这个问题不仅与并行化有关,还与转置的缓存感知和基于瓦片的算法有关.可能的解决方案是

This problem is not only related to parallelization, but also cache-aware and tile-based algorithm of transpose. Possible solutions could be

  • 将 numba 代码与基于图块的算法结合使用(由 Jérôme Richard 的回答说明).虽然需要 numpy 以外的包,但它几乎是纯 python.
  • 使用优化的 blas 库(例如 MKL 或 cuBLAS,它们实现了自己的矩阵转置 API)并链接到 python,而不是 CBLAS 或 BLAS.如果要分发这个程序,需要准备makefile或动态链接库.
  • 使用pyscf.lib.transpose(python 链接c link) 用于并行 3-index 张量转置 M.transpose(0,2,1).我在某种程度上是 pyscf 的粉丝.它的主要目的是量子或半经验化学计算,但它的一些数值计算实用程序可能为此目的进行了优化.在我在服务器上的测试中,转置 (1502, 601, 601) 张量可能比 MKL mkl_domatcopy(9.19 秒)快两倍(4.09 秒).
  • Use numba code with tile-based algorithm (stated by answer from Jérôme Richard). Though requires packages other than numpy, it's almost pure python.
  • Use optimized blas libraries (such as MKL or cuBLAS, which implement their own matrix transpose API) and link to python, instead of CBLAS or BLAS. Need to prepare makefiles or dynamic linked library if this program is to be distributed.
  • Use pyscf.lib.transpose (python link, c link) for parallel 3-index tensor transpose M.transpose(0,2,1). I'm somehow a fan of pyscf. It's main purpose is quantum or semi-empirical chemistry calculation, but some of it's numerical calculation utilities is probably optimized for this purpose. In my testing on a server, transposing a (1502, 601, 601) tensor could be twice faster (4.09 sec) than MKL mkl_domatcopy (9.19 sec).

相关算法文章:

相关的 stackoverflow 和 github 问题页面:

Related stackoverflow and github issue pages:

推荐答案

有效地计算换位很困难.此原语不受计算限制,而是受内存限制.对于存储在 RAM(而不是 CPU 缓存)中的大矩阵尤其如此.

Computing transpositions efficiently is hard. This primitive is not compute-bound but memory-bound. This is especially true for big matrices stored in the RAM (and not CPU caches).

Numpy 使用基于视图的方法,当只需要数组的一个切片时,这种方法非常有用,并且急切地进行计算(例如执行复制时)非常慢.在这种情况下执行复制时,Numpy 的实现方式会导致大量缓存未命中,从而大大降低了性能.

Numpy use a view-based approach which is great when only a slice of the array is needed and quite slow the computation is done eagerly (eg. when a copy is performed). The way Numpy is implemented results in a lot of cache misses strongly decreasing performance when a copy is performed in this case.

我发现这个函数几乎总是在单线程中运行,无论转置矩阵/数组有多大.

这显然取决于所使用的 Numpy 实现.AFAIK,一些优化的程序包,如英特尔的程序包,效率更高,并且更经常地利用多线程.

This is clearly dependant of the Numpy implementation used. AFAIK, some optimized packages like the one of Intel are more efficient and take more often advantage of multithreading.

我认为并行转置函数应该是一种解决方案.问题是如何实现它.

是和否.使用更多线程可能不需要更快.至少不会更多,也不是在所有平台上.所使用的算法远比使用多线程重要.

Yes and no. It may not be necessary faster to use more threads. At least not much more, and not on all platforms. The algorithm used is far more important than using multiple threads.

在现代桌面 x86-64 处理器上,每个内核都可以由其缓存层次结构限定.但是这个限制是相当高的.因此,两个内核通常足以使 RAM 吞吐量接近饱和.例如,在我的(4 核)机器上,顺序复制可以达到 20.4 GiB/s(Numpy 成功达到此限制),而我的(实际)内存吞吐量接近 35 GiB/s.复制 A 需要 72 毫秒,而朴素的 Numpy 转置 A 需要 700 毫秒.即使使用我的所有内核,并行实现的速度也不会超过 175 毫秒,而最佳理论时间为 42 毫秒.实际上,由于缓存未命中和我的 L3 缓存饱和,简单的并行实现会比 175 毫秒慢得多.

On modern desktop x86-64 processors, each core can be bounded by its cache hierarchy. But this limit is quite high. Thus, two cores are often enough to nearly saturate the RAM throughput. For example, on my (4-core) machine, a sequential copy can reach 20.4 GiB/s (Numpy succeed to reach this limit), while my (practical) memory throughput is close to 35 GiB/s. Copying A takes 72 ms while the naive Numpy transposition A takes 700 ms. Even using all my cores, a parallel implementation would not be faster than 175 ms while the optimal theoretical time is 42 ms. Actually, a naive parallel implementation would be much slower than 175 ms because of caches-misses and the saturation of my L3 cache.

简单的转置实现不会连续写入/读取数据.内存访问模式是strided,大部分缓存行都被浪费了.正因为如此,数据在巨大的矩阵上从内存中读取/写入多次(在当前使用双精度的 x86-64 平台上通常为 8 次).基于图块的换位算法是防止此问题的有效方法.它还大大减少了缓存未命中.理想情况下,应该使用分层切片或缓存无视Z-tiling模式,尽管这难以实施.

Naive transposition implementations do not write/read data contiguously. The memory access pattern is strided and most cache-lines are wasted. Because of this, data are read/written multiple time from memory on huge matrices (typically 8 times on current x86-64 platforms using double-precision). Tile-based transposition algorithm is an efficient way to prevent this issue. It also strongly reduces cache misses. Ideally, hierarchical tiles should be used or the cache-oblivious Z-tiling pattern although this is hard to implement.

以下是基于之前信息的基于 Numba 的实现:

Here is a Numba-based implementation based on the previous informations:

@nb.njit('void(float64[:,::1], float64[:,::1])', parallel=True)
def transpose(mat, out):
    blockSize, tileSize = 256, 32  # To be tuned
    n, m = mat.shape
    assert blockSize % tileSize == 0
    for tmp in nb.prange((m+blockSize-1)//blockSize):
        i = tmp * blockSize
        for j in range(0, n, blockSize):
            tiMin, tiMax = i, min(i+blockSize, m)
            tjMin, tjMax = j, min(j+blockSize, n)
            for ti in range(tiMin, tiMax, tileSize):
                for tj in range(tjMin, tjMax, tileSize):
                    out[ti:ti+tileSize, tj:tj+tileSize] = mat[tj:tj+tileSize, ti:ti+tileSize].T

如果您想要更快的代码,您可以使用非常优化的本机库来实现转置,如英特尔 MKL.此类库通常利用低级处理器特定指令(SIMD 指令和非临时存储)来更有效地使用缓存/RAM.

If you want a faster code, you can use very-optimized native libraries implementing the transposition like the Intel MKL. Such libraries often take advantage of low-level processor-specific instructions (SIMD instruction & non-temporal stores) to use the caches/RAM more efficiently.

这里是时序结果(假设输出矩阵已经填满内存):

Here are the timing results (assuming the output matrix is already filled in memory):

Naive Numpy:                           700 ms
Above code without Numba (sequential): 287 ms
Numba version (sequential):            157 ms
Numba version (parallel):              104 ms
Very-optimized C code (parallel):       68 ms
Theoretical optimum:                    42 ms

这篇关于如何避免单线程NumPy转置的巨大开销?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-06 10:44