本文介绍了k x k个布尔矩阵的快速乘法,其中8< = k< = 16的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找到一种将两个小的布尔矩阵相乘的最快方法,其中小的平均值是8x8、9x9 ... 16x16.该例程将被大量使用,因此它必须非常有效,因此请不要建议直接的解决方案应该足够快.

对于特殊情况8x8和16x16,基于在此处找到解决方案,我们将整个矩阵分别视为uint64_tuint64_t[4].在我的机器上,这比直接实现快大约70-80倍.

但是,在8< k < 16,我真的不知道如何利用任何合理的表示方式来实现上述巧妙的技巧.

因此,基本上,我愿意接受任何使用(矩阵)表示形式和函数签名的建议.您可能会认为这是针对32位或64位体系结构的(选择最适合您建议的方法)

解决方案

给出两个4x4矩阵a = 0010,0100,1111,0001,b = 1100,0001,0100,0100,首先可以计算转置b'= 1000,1011,0000,0100.

然后,所得矩阵M(i,j)= a×b mod 2 == popcount(a [i]& b [j])& 1; //或奇偶校验

由此可见,只要位向量适合计算机字,复杂度只会增加n ^ 2.

如果可以使用某些特殊的排列和位选择操作,则至少可以加快8x8矩阵的速度.一个向量中的NxN位可以精确地迭代N次. (因此16x16几乎是极限).

每个步骤都由累加即Result(n + 1)= Result(n)XOR A(n)组成. B(n),其中Result(0)= 0,A(n)是A<<< n和'<<<' ==元素的按列旋转,其中B(n)从矩阵B复制对角元素:

    a b c          a e i          d h c          g b f
B=  d e f  B(0) =  a e i  B(1) =  d h c   B(2) = g b f
    g h i          a e i          d h c          g b f

再想一想,更好的选择是^^^(逐行旋转)矩阵B并从A中选择A(n)==列复制对角线:

    a b c         a a a           b b b           c c c
A=  d e f  A(0) = e e e , A(1) =  f f f,  A(2) =  d d d
    g h i         i i i           g g g           h h h

编辑为了使以后的读者受益,我提出了便携式C中W

#include <stdint.h>
void matrix_mul_gf2(uint16_t *a, uint16_t *b, uint16_t *c)
{
    // these arrays can be read in two successive xmm registers or in a single ymm
    uint16_t D[16];      // Temporary
    uint16_t C[16]={0};  // result
    uint16_t B[16];
    uint16_t A[16];
    int i,j;
    uint16_t top_row;
    // Preprocess B (while reading from input)
    // -- "un-tilt" the diagonal to bit position 0x8000
    for (i=0;i<W;i++) B[i]=(b[i]<<i) | (b[i]>>(W-i));
    for (i=0;i<W;i++) A[i]=a[i];  // Just read in matrix 'a'
    // Loop W times
    // Can be parallelized 4x with MMX, 8x with XMM and 16x with YMM instructions
    for (j=0;j<W;j++) {
        for (i=0;i<W;i++) D[i]=((int16_t)B[i])>>15;  // copy sign bit to rows
        for (i=0;i<W;i++) B[i]<<=1;                  // Prepare B for next round
        for (i=0;i<W;i++) C[i]^= A[i]&D[i];          // Add the partial product

        top_row=A[0];
        for (i=0;i<W-1;i++) A[i]=A[i+1];
        A[W-1]=top_row;
    }
    for (i=0;i<W;i++) c[i]=C[i];      // return result
}

I want to find an as fast as possible way of multiplying two small boolean matrices, where small means, 8x8, 9x9 ... 16x16. This routine will be used a lot, so it needs to be very efficient, so please don't suggest that the straightforward solution should be fast enough.

For the special cases 8x8, and 16x16 I already have fairly efficient implementations, based on the solution found here, where we treat the entire matrix as an uint64_t or uint64_t[4] respectively. On my machine this is roughly 70-80 times faster than the straightforward implementation.

However, in the case of 8 < k < 16, I don't really know how I can leverage any reasonable representation in order to enable such clever tricks as above.

So basically, I'm open for any suggestions using any kind of representation (of the matrices) and function signature. You may assume that this targets either a 32-bit or 64-bit architecture (pick what best suits your suggestion)

解决方案

Given two 4x4 matrices a= 0010,0100,1111,0001, b=1100,0001,0100,0100, one could first calculate the transpose b' = 1000,1011,0000,0100.

Then the resulting matrix M(i,j)=a x b mod 2 == popcount(a[i]&b[j]) & 1; // or parity

From that one can notice that the complexity only grows in n^2, as long as the bitvector fits a computer word.

This can be speed up for 8x8 matrices at least, provided that some special permutation and bit selection operations are available. One can iterate exactly N times with NxN bits in a vector. (so 16x16 is pretty much the limit).

Each step consists of accumulating i.e. Result(n+1) = Result(n) XOR A(n) .& B(n), where Result(0) = 0, A(n) is A <<< n, and '<<<' == columnwise rotation of elements and where B(n) copies diagonal elements from the matrix B:

    a b c          a e i          d h c          g b f
B=  d e f  B(0) =  a e i  B(1) =  d h c   B(2) = g b f
    g h i          a e i          d h c          g b f

And after thinking it a bit further, a better option is to ^^^ (row wise rotate) matrix B and select A(n) == column copied diagonals from A:

    a b c         a a a           b b b           c c c
A=  d e f  A(0) = e e e , A(1) =  f f f,  A(2) =  d d d
    g h i         i i i           g g g           h h h

EDIT To benefit later readers, I'd propose the full solution for W<=16 bit matrix multiplications in portable C.

#include <stdint.h>
void matrix_mul_gf2(uint16_t *a, uint16_t *b, uint16_t *c)
{
    // these arrays can be read in two successive xmm registers or in a single ymm
    uint16_t D[16];      // Temporary
    uint16_t C[16]={0};  // result
    uint16_t B[16];
    uint16_t A[16];
    int i,j;
    uint16_t top_row;
    // Preprocess B (while reading from input)
    // -- "un-tilt" the diagonal to bit position 0x8000
    for (i=0;i<W;i++) B[i]=(b[i]<<i) | (b[i]>>(W-i));
    for (i=0;i<W;i++) A[i]=a[i];  // Just read in matrix 'a'
    // Loop W times
    // Can be parallelized 4x with MMX, 8x with XMM and 16x with YMM instructions
    for (j=0;j<W;j++) {
        for (i=0;i<W;i++) D[i]=((int16_t)B[i])>>15;  // copy sign bit to rows
        for (i=0;i<W;i++) B[i]<<=1;                  // Prepare B for next round
        for (i=0;i<W;i++) C[i]^= A[i]&D[i];          // Add the partial product

        top_row=A[0];
        for (i=0;i<W-1;i++) A[i]=A[i+1];
        A[W-1]=top_row;
    }
    for (i=0;i<W;i++) c[i]=C[i];      // return result
}

这篇关于k x k个布尔矩阵的快速乘法,其中8&lt; = k&lt; = 16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 23:39