使用位与AND和popcount而不是实际的int或float乘法的大(0,1)矩阵乘法?

本文介绍了使用位与AND和popcount而不是实际的int或float乘法的大(0,1)矩阵乘法?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于乘法大型二进制矩阵(10Kx20K)，我通常要做的是将矩阵转换为浮点矩阵并执行浮点矩阵乘法，因为整数矩阵乘法非常慢().

For multiplying large binary matrices (10Kx20K), what I usually to do is to convert the matrices to float ones and perform float matrix multiplication as integer matrix multiplication is pretty slow (have a look at here).

但这一次，我需要执行数十万次这样的乘法运算，并且对我而言，平均而言，即使是毫秒级的性能提升.

This time though, I'd need to perform over hundred thousands of these multiplications and even a millisecond performance improvement on average matters to me.

我希望得到一个int或float矩阵，因为乘积可能具有非0或1的元素.输入矩阵元素全为0或1，因此可以将它们存储为单个位

I want an int or float matrix as a result, because the product may have elements that aren't 0 or 1. The input matrix elements are all 0 or 1, so they can be stored as single bits.

在行向量和列向量之间的内积(产生输出矩阵的一个元素)中，乘法简化为按位与.加法仍然是加法运算，但是我们可以使用填充计数功能添加位，而不必单独遍历它们.

In the inner-product between a row vector and a column vector (to produce one element of the output matrix), multiplication simplifies to bitwise AND. Addition is still addition, but we can add bits with a population-count function instead of looping over them individually.

其他一些布尔/二进制矩阵函数或位而不是对它们进行计数，产生位矩阵结果，但这不是我所需要的.

Some other boolean/binary-matrix functions OR the bits instead of counting them, producing a bit-matrix result, but that's not what I need.

这是一个示例代码，显示以std::bitset，AND和count运算形成问题的速度比矩阵乘法快.

Here is a sample code showing that forming the problem as std::bitset, AND and count operations is faster than matrix multiplication.

#include <iostream>
using std::cout; using std::endl;
#include <vector>
    using std::vector;
#include <chrono>
#include <Eigen/Dense>
    using Eigen::Map; using Eigen::Matrix; using Eigen::MatrixXf;
#include <random>
    using std::random_device; using std::mt19937; using std::uniform_int_distribution;
#include <bitset>
    using std::bitset;

using std::floor;

const int NROW = 1000;
const int NCOL = 20000;

const float DENSITY = 0.4;
const float DENOMINATOR = 10.0 - (10*DENSITY);

void fill_random(vector<float>& vec) {
    random_device rd;
    mt19937 eng(rd());
    uniform_int_distribution<> distr(0, 10);
    int nnz = 0;
    for (int i = 0; i < NROW*NCOL; ++i)
        vec.push_back(floor(distr(eng)/DENOMINATOR));
}

void matmul(vector<float>& vec){
    float *p = vec.data();
    MatrixXf A = Eigen::Map<Eigen::Matrix<float, NROW, NCOL, Eigen::RowMajor>>(p);
    cout << "Eigen matrix has " << A.rows() << " rows and " << A.cols() << " columns." << endl;
    cout << "Total non-zero values : " << A.sum() << endl;
    cout << "The density of non-zero values is " <<  A.sum() * 1.0 / (A.cols()*A.rows()) << endl;

    auto start = std::chrono::steady_clock::now();
    MatrixXf B = A.transpose() * A;
    auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
    cout << "Mat mul took " << end << " ms"<< endl;

    // Just to make sure the operation is not skipped by compiler
    cout << "Eigen coo ";
    for (int i=0; i<10; ++i)
        cout << B(0,i) << " ";
    cout << endl;
}


void bitset_op(vector<float>& vec) {
    // yeah it's not a great idea to set size at compile time but have to
    vector<bitset<NROW>> col_major(NCOL);

    // right, multiple par for isn't a good idea, maybe in a parallel block
    // Doing this for simplicity to profile second loop timing
    // converting row major float vec to col major bool vec
    #pragma omp parallel for
    for (int j=0; j < NCOL; ++j) {
        for (int i=0; i < NROW; ++i) {
            col_major[j].set(i, vec[i*NCOL + j] && 1);
        }
    }

    auto start = std::chrono::steady_clock::now();
    vector<int> coo;
    coo.assign(NCOL*NCOL, 0);
    #pragma omp parallel for
    for (int j=0; j < NCOL; ++j) {
        for (int k=0; k<NCOL; ++k) {
            coo[j*NCOL + k] = (col_major[j]&col_major[k]).count();
        }
    }
    auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
    cout << "bitset intersection took " << end << " ms"<< endl;

    // Just to make sure the operation is not skipped by compiler
    cout << "biset coo ";
    for (int i=0; i<10; ++i)
        cout << coo[i] << " ";
    cout << endl;
}


int main() {
    // Saving to float instead of int to speed up matmul
    vector<float> vec;
    fill_random(vec);
    matmul(vec);
    bitset_op(vec);
}

通过以下方式运行此

g++ -O3 -fopenmp -march=native -I. -std=c++11 code.cpp -o code

我得到:

Eigen matrix has 1000 rows and 20000 columns.
Total non-zero values : 9.08978e+06
The density of non-zero values is 0.454489
Mat mul took 1849 ms
Eigen coo 458 206 208 201 224 205 204 199 217 210
bitset intersection took 602 ms
biset coo 458 206 208 201 224 205 204 199 217 210

如您所见，作为位集操作集合的matmul大约比Eigen的float matmul快3倍，这是有道理的.

As you can see, matmul as set of bitset operations is about 3x faster than Eigen's float matmul, which makes sense.

我想强调一点，我需要在100K以上的操作(在HPC或云中)上执行此操作，而平均毫秒级的性能改进将有所作为.

I want to emphasize that I need to perform this operation over 100K (in the HPC or cloud) and a millisecond performance improvement on average would make a difference.

我不受任何特定库，C ++标准等的约束.因此，请随意回答您认为比使用GPU更快的任何解决方案，因为由于多种原因，我无法使用它

I'm not bound to any specific library, C++ standard, etc. So please feel free to answer with any solution that you think is faster other than those using GPU, as I can't use it for a number of reasons.

推荐答案

我不确定有多少(如果有的话)可以让编译器为您完成，而无需使用内在函数或C ++向量类包装器进行手动向量化(如果您的项目许可证与GPL兼容，则类似于 Agner Fog的VCL .也有一些非GPL包装器.

I'm not sure how much, if any, you can get the compiler to do for you without manually vectorizing with intrinsics or a C++ vector-class wrapper (like Agner Fog's VCL, if your project's license is compatible with the GPL). There are some non-GPLed wrappers, too.

对矩阵乘法进行缓存块是一项很好的工作(在这里很重要)，如果您可以使用Eigen的现有模板，但使用按位and在整数上，而不是在浮点数上相乘.我不确定这是否可能.

Cache-blocking a matrix multiply is a fine art (and will be important here), and it would be really nice if you could use Eigen's existing templates but with a different class that uses bitwise and on integers, instead of multiply on floats. I'm not sure if this is possible.

我做了一些搜索，关于二进制矩阵的大多数文献都是关于产生布尔结果的(包括SO问题).向量内积是使用AND作为乘法，但使用XOR或OR作为加法，而不是popcount.也许有一个我找不到的搜索词描述了正常"矩阵，而这些矩阵恰好是(0,1)矩阵，但是乘积却没有.

I did some searching, and most of the literature about binary matrices is about producing a boolean result (including SO questions like this). A vector inner-product is done with AND as the multiply, but with XOR or OR as the addition, not popcount. Maybe there's a search-term I'm missing that describes "normal" matrices that just happen to be (0,1) matrices, but where the product won't be.

因为每毫秒都很重要，所以您可能必须手动将其向量化.

Since every millisecond matters, you're probably going to have to manually vectorize this.

一般来说，并不是向量整数的东西很慢，而是 just 向量整数的乘法比较慢，尤其是与最近x86硬件上的vector- float FMA相比(尤其是Intel，它具有FP FMA在Haswell及更高版本上每个时钟2x 256b向量的吞吐量.

It's not that vector-integer stuff is slow in general, it's just vector-integer multiply that's slow, especially compared to vector-float FMA on recent x86 hardware (especially Intel, which has FP FMA throughput of 2x 256b vectors per clock on Haswell and later).

由于您不需要使用布尔元素进行实际的乘法运算，而只需要一个AND(每个时钟吞吐量3个向量)，这对您来说就不成问题了.每个向量执行更多元素所获得的效率收益，应该比弥补每个向量的任何额外成本要多.

Since you don't need an actual multiply with boolean elements, just an AND (3 vectors per clock throughput), that's not a problem for you. The efficiency-gain from doing many more elements per vector should more than make up for any extra cost per-vector.

当然，这假定使用与等效FP matmul相同的所有缓存块和其他优化的整数matmul实现，这就是问题所在，如果您不想(或不知道如何)自己编写，找不到适合您的库.

Of course, this assumes an integer matmul implementation using all the same cache-blocking and other optimizations as an equivalent FP matmul, and that's where the problem lies if you don't want to (or don't know how to) write it yourself, and can't find a library that will do it for you.

我正在回答一个问题，即采用最佳实现可能会达到的效率.标题问题的答案是非常明确的是，使用实际的乘法运算会浪费大量时间，尤其是使用32位元素时.

I'm just answering the question of how efficient it could be, with an optimal implementation. The answer to the title question is a very definite yes, it's a massive waste of time to use actual multiplication, especially with 32-bit elements.

每字节一个元素(0/1):

密度是float的4倍(缓存占用空间/内存带宽/每个向量元素)
易于与字节随机播放换位

如果重要(例如，在外循环上进行矢量化并一次处理多行或多列)，
垂直ADD很容易.如果您将数据交织到其中，可能会很好(避免在行末添加水平和).一种无需额外改组就能使这项工作正常进行的方式.)

4x the density of float (cache footprint / memory bandwidth / elements per vector)
easy to transpose with byte shuffles
vertical ADD is easy, in case that matters (e.g. for vectorizing over an outer loop, and working on multiple rows or multiple columns at once. Can be good (avoiding horizontal sums at the end) if you have your data interleaved in a way that makes this work without extra shuffling.)

每字节4个元素，打包成低半字节:

4倍于单独字节的密度
使用AVX2 vpshufb 非常高效.从理论上讲，在L1D高速缓存中输入很热的情况下，您可以在每个时钟周期(每个内核)以128个AND结果元素的吞吐量加载/与/累积一个popcount.每个时钟4个融合域uops会使SKL/HSW前端发布带宽达到每个时钟4个，并且不会在3个矢量ALU端口上造成瓶颈，因为其中一个uops是纯负载. (另一个用vpand加载的微型保险丝).当成为L2带宽的瓶颈(每个周期约32B负载)时，每个时钟以64个元素运行.参见下文.
从整数或打包位图创建的速度较慢(但如果您将位按交错顺序放入向量中以有效地打包/解包到有序字节，而不是强制使位按顺序排列，则也不错).
难以调换(可能比完全装填还差)

4x the density of separate bytes
very efficient to popcount with AVX2 vpshufb. With inputs hot in L1D cache, you could load/AND/accumulate-a-popcount with a throughput of 128 AND-result elements per clock cycle (per core), in theory. 4 fused-domain uops per clock saturates SKL/HSW front-end issue bandwidth of 4 per clock, and doesn't bottleneck on the 3 vector ALU ports, because one of the uops is a pure load. (The other load micro-fuses with the vpand). When bottlenecked on L2 bandwidth (~one 32B load per cycle), runs at 64 elements per clock. See below.
slower to create from integer or packed-bitmap (but not bad if you put bits into vectors in an interleaved order for efficient pack/unpack to in-order bytes, rather than forcing bits to be in order).
hard to transpose (maybe worse than fully packed)

打包位:

独立字节密度的8倍，每个AVX2向量256个元素.

可以从具有pmovmskb的向量创建
，以实现非交错存储顺序. (不过，对于即时创建不是很有用，因为这样会将结果放入整数reg中，而不是向量中.交错的位顺序可能是最好的，尤其是在转置过程中解包).
使用AVX2弹出计数非常有效:mask/shift + mask/2x vpshufb. (9个融合域uops(8个vector-ALU uops)用于AND +累积256个元素的弹出计数(来自2行/列矢量)，而8字节(6个vector-ALU uops)的字节数为4(按字节策略)( (来自4行/列向量).)ALU端口瓶颈将其限制为每个时钟来自L1D或L2的96个元素.因此，当它遇到L2带宽瓶颈时，pack4策略的内部产品吞吐量约为其1.5倍，而在L1D中热数据的吞吐量则为3/4，从理论上讲，仅计算内部循环.这只是内部产品部分，没有考虑不同的包装/拆包成本.
难以移置(但对于 pmovmskb从每个字节中提取1位并使其连续().

8x the density of separate bytes, 256 elements per AVX2 vector.
can be created from vectors with pmovmskb for a non-interleaved storage order. (not very useful for creation on the fly, though, since that puts the result in an integer reg, not a vector. An interleaved bit-order is probably best, esp. for unpacking during a transpose).
fairly efficient to popcount with AVX2: mask / shift+mask / 2xvpshufb. (9 fused-domain uops (8 vector-ALU uops) to AND + accumulate popcount for 256 elements (from 2 row/column vectors), vs. 8 uops (6 vector-ALU uops) for the 4-per-byte strategy (from 4 row/column vectors).) ALU port bottlenecks limit this to 96 elements per clock from L1D or L2. So this has about 1.5x the inner-product throughput of the pack4 strategy when it bottlenecks on L2 bandwidth, or 3/4 the throughput for data hot in L1D, in theory, counting just the inner loop. This is just the inner-product part, not accounting for different pack/unpack costs.
hard to transpose (but maybe not horrible with pmovmskb to extract 1 bit from each byte and make them contiguous).

每字节6个元素，0xxx0xxx (在HSW/SKL上此问题可能没有优势，但值得考虑):

6 elements per bytes, 0xxx0xxx (probably no advantages for this problem on HSW/SKL, but interesting to consider):

6倍于单独字节的密度
通过移位/或，以交错方式从0/1字节创建，非常容易，与每字节4位格式相同.
针对使用AVX2 vpshufb 的高效弹出计数进行了优化.无需在2x vpshufb之前屏蔽，只需右移1个即可. (如果设置了高位，则vpshufb将字节清零，否则将使用低位半字节作为索引.这就是为什么需要屏蔽的原因.)右移此格式4(vpsrld ymm0,4)仍将保留零.每个字节的高位. Load + AND->累积popcount每个向量为7个融合域uops(vmovdqa/vpand ymm,[mem]/vpsrld ymm,4/2x vpshufb/2x vpaddb)，其中只有6个需要ALU端口.因此，HSW/SKL吞吐量理论上是每2个时钟1个向量(192个元素)，或每个时钟96个元素.这要求每个时钟的平均负载吞吐量为一个256b向量，因此正好解决了L2带宽瓶颈.

6x the density of separate bytes
fairly easy to create from 0/1 bytes in an interleaved way, by shifting/ORing, same as the 4 bits per byte format.
optimized for efficient popcount with AVX2 vpshufb. No need to mask before 2xvpshufb, just 1 right-shift. (vpshufb zeros the byte if the high bit is set, otherwise it uses the low nibble as an index. This is why it needs the masking.) Right shifting this format by 4 (vpsrld ymm0,4) will still leave a zero in the high bit of every byte. Load+AND -> accumulate popcount is 7 fused-domain uops per vector (vmovdqa/vpand ymm,[mem]/vpsrld ymm,4/2xvpshufb/2xvpaddb), only 6 of which need ALU ports. So HSW/SKL throughput is in theory 1 vector (of 192 elements) per 2 clocks, or 96 elements per clock. This requires an average load throughput of one 256b vector per clock, so it's right up against the L2 bandwidth bottleneck.

理论上，它与完全打包相同，但是在实践中，它可能会更快或更慢，具体取决于哪个计划更好(例如，更少的AND/ADD指令会从随机播放中窃取端口5).完全打包可能更可能接近理论速度，因为它的更多微指令可以在多个端口上运行.调度混乱的可能性较小.

In theory it's the same as fully packed, but in practice it might be slightly faster or slower depending on which one schedules better (fewer AND/ADD uops stealing port 5 from shuffles, for example). Fully packed is probably more likely to come close to theoretical speed, because more of its uops can run on multiple ports. Out-of-order scheduling imperfections are less likely.

与此有关的另一种变化是，使用单个AVX512VBMI(Cannonlake?)可以弹出计数每个字节7个元素(Cannonlake?). vpermi2b(_mm512_permutex2var_epi8) ，其中每个索引字节选择一个从另外两个寄存器的串联中取出128个字节.如此大的混洗可能会比较慢，但是希望它的吞吐率比AVX512 vpshufb单独的蚕食更好.

Another variation on this, 7 elements per byte could be popcounted with a single AVX512VBMI (Cannonlake?) vpermi2b (_mm512_permutex2var_epi8), where each index byte selects one of 128 bytes from the concatenation of two other registers. A shuffle that wide will probably be slow, but it will hopefully have better throughput than an AVX512 vpshufb separate-nibble thing.

使用AVX512VBMI(但不包含 AVX512VPOPCNTDQ )来计算打包8 )，则可以使用vpermi2b来计数低7，然后对第一个位进行移位+屏蔽并添加它. (一位的流行人数=该位).

To count packed-8 with AVX512VBMI (but without AVX512VPOPCNTDQ), you could maybe use vpermi2b to count the low 7, then shift+mask the top bit and just add it. (popcount of a single bit = that bit).

uint8_t元素更容易有效地洗牌(因为存在像vpshufb这样的字节洗牌)，因此如果您必须即时转置，可能值得考虑.还是只在转置时动态打包?

uint8_t elements are easier to shuffle efficiently (since there are byte shuffles like vpshufb), so may be worth considering if you have to transpose on the fly. Or only pack down to bits on the fly while transposing?

32位整数也是一个选择，但不是一个好的选择.每个向量中的元素较少意味着转置中的随机播放指令较少，但不是4的倍数.转置中的随机播放数量可能与log2(每个向量中的元素)类似.

32-bit integers are also an option, but not a good option. Fewer elements per vector means fewer shuffle instructions in a transpose, but not by a factor of 4. The number of shuffles in a transpose may scale with something like log2(elements per vector).

这对于缓存占用空间/内存带宽也很重要.大小差异为8的倍数可能意味着，整行或整列只占用L1的一部分，而不是溢出L1.因此，它可以使缓存阻塞更容易/不那么重要.

This is also a big deal for cache footprint / memory bandwidth. The factor of 8 size difference can mean that doing a whole row or column only takes part of L1, instead of overflowing L1. So it can make cache-blocking easier / less important.

10k * 20k/8 = 23.84MiB.这远远大于L2缓存(在Haswell上为256kiB，在Skylake-AVX512上为1MiB )，但将在多核Xeon CPU的L3中使用.但是L3被所有核心(包括云环境中的其他VM)竞争性地共享，并且比L2慢得多. (像您将在HPC/云系统中运行的许多核心Xeon的每核内存带宽要比四核台式机低，这是因为L3缓存的延迟更长，而并发性却没有增加(请参阅.即使在总吞吐量上，也需要更多的内核才能在Xeon上驱动相同数量的内存带宽.更高.但是，如果您可以使每个核心大部分都在其私有L2上工作，那么您将获得很多.)

10k * 20k / 8 = 23.84MiB per matrix, using packed-bit elements. That's much larger than L2 cache (256kiB on Haswell, 1MiB on Skylake-AVX512), but will fit in L3 on many-core Xeon CPUs. But L3 is competitively shared by all cores (including other VMs in a cloud environment), and is much slower than L2. (Many-core Xeons like you will be running on in HPC / cloud systems have lower per-core memory bandwidth than quad-core desktops, because of higher latency to L3 cache with no increase in concurrency (see the "latency-bound platforms" section of this answer. It takes more cores to drive the same amount of memory bandwidth on a Xeon, even though the total throughput is higher. But if you can have each core mostly working out of its private L2, you gain a LOT.)

相加和结果:您已安排好循环，因此需要将一次布尔运算减少为非零计数.这是一件好事.

Adding up the AND results: You've arranged your loops so you need to reduce a single run of booleans to a count of the non-zeros. This is a good thing.

对于8位整数0/1元素，最多可以做255个vpaddb，然后元素才会溢出.它具有良好的吞吐量:Haswell每个时钟2个，Skylake每个时钟3个.使用多个累加器，可以覆盖许多AND结果向量.对全零向量使用 vpsadbw将向量中的字节水平添加为64位整数.然后将累加器与vpaddq，组合在一起，然后水平求和它.

With 8-bit integer 0/1 elements, you can do up to 255 vpaddb before an element could overflow. It has good throughput: 2 per clock on Haswell, 3 per clock on Skylake. With multiple accumulators, that covers a lot of vectors of AND results. Use vpsadbw against an all-zero vector to horizontally add the bytes in a vector into 64-bit integers. Then combine your accumulators with vpaddq, then horizontally sum it.

使用压缩位，您只想对AND结果的向量进行计数.有了AVX2，并且您的数据已经在矢量中，您肯定要使用基于VPSHUFB的位切片弹出计数. (请参见 http://wm.ite.pl/articles/sse-popcount.html .如果您必须手动对其进行矢量化处理，则需要使用内部函数而不是asm来编写它.)

With packed bits, you just want to popcount the vectors of AND results. With AVX2 and your data already in vectors, you definitely want to use aVPSHUFB-based bit-slicing popcount. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)

您可以考虑将数据每字节4位打包，以低半字节为单位.这意味着一个vpshufb可以对AND结果的每个字节中的位进行计数，而无需进行任何移位/掩蔽.在内部循环中，您将有2个负载，分别是vpand，vpshufb和vpaddb.通过适当的展开，这应该可以跟上每个时钟2x 32B的L1D负载带宽，并使所有三个向量执行端口(在Haswell或Skylake上)都饱和.中断每128或255个向量，或使用vpsadbw/vpaddq累加某个累加器的字节. (但是对于缓存阻止，您可能还是想经常这样做，并执行不同的工作). 因此，最里面的循环应以每个字节4个元素*每个向量32B =每个时钟周期128个元素，，如果您可以安排它读取L1D缓存中较热的数据.从Haswell/Skylake的L2缓存中获得大约一半的带宽，或者从L3缓存中获得更差的带宽.

You could consider packing your data 4 bits per byte, in the low nibble. That would mean one vpshufb could count the bits in each byte of an AND result, without needing any shifting / masking. Inside the inner loop, you'd have 2 loads, vpand, vpshufb, vpaddb. With proper unrolling, that should keep up with L1D load bandwidth of 2x 32B per clock, and saturate all three vector execution ports (on Haswell or Skylake). Break out of that every 128 or 255 vectors or something to accumulate the bytes of your accumulator(s) with vpsadbw/vpaddq. (But with cache-blocking, you probably want to break out that often anyway and do different work). So the inner-most loop should run at 4 elements per byte * 32B per vector = 128 elements per clock cycle, if you can arrange for it to read data that's hot in L1D cache. Expect about half that bandwidth from L2 cache on Haswell/Skylake, or much worse from L3 cache.

对于uint8_t元素为0或1的情况，可以使用一些整数乘法加法指令.它们的设计有点怪异，旨在用于与FP FMA不同的用例.它们将水平对相乘结果相乘，从而产生更宽的元素. VPMADDUBSW 可以从8位元素扩展到16位元素，并且可以在0和1上很好地工作.由于每个元素只能在0..2范围内，因此您仍然可以使用vpsadbw进行水平和.但是，如果您要转到vpsadbw，则与vpand相比，您没有任何收获.仅当您想在向量累加器中使用vpaddw来使用16位元素时才有用，而不是为了避免字节溢出而中断循环. vpmaddubsw doesn't seem useful here, because vpsadbw`是水平添加字节的更好方法.

With uint8_t elements that are 0 or 1, you can maybe use some integer multiply-add instructions. They're a bit weirdly designed, intended for different use-cases than FP FMA. They add horizontal pairs of multiply results, producing wider elements. VPMADDUBSW widens from 8 to 16 bit elements, and would work well on 0s and 1s. Since each element can only be in the range 0..2, you can still horizontal-sum with vpsadbw. But if you're going to vpsadbw, this gains you nothing over vpand. It would only be useful if you wanted to use vpaddw to use 16-bit elements in your vector accumulator, instead of breaking out of a loop to avoid byte overflow. vpmaddubsw doesn't seem useful here, becausevpsadbw` is a better way to horizontally add bytes.

使用SSE/AVX可以有效地将0/1整数转换为位图:对于32位整数元素，vpslld ymm0, 31将相关位左移到每个元素的顶部，然后vmovmskps eax, ymm0获得每个32位元素的高字节的8位掩码.对于8位整数元素，vpslld ymm0, 7/vpmovmskb eax, ymm0会执行相同的操作，但对于每个字节，将生成32位整数位图结果. (只有每个字节的符号位很重要，因此没有仅具有8位粒度的移位指令就可以了.您无需对进入下一个元素的位进行任何操作.)

Converting 0/1 integers to bitmaps can be done efficiently with SSE/AVX: For 32-bit integer elements, vpslld ymm0, 31 to left-shift the relevant bit to the top of each element, then vmovmskps eax, ymm0 to get an 8-bit mask of the high byte of each 32-bit element. For 8-bit integer elements, vpslld ymm0, 7 / vpmovmskb eax, ymm0 to do the same thing but for each byte, producing a 32-bit integer bitmap result. (Only the sign bit of each byte matters, so it's fine that there are no shift instructions with only 8 bit granularity. You don't need to do anything about the bits that carry into the next element.)

这不是立即使用向量的好方法，因为最终将结果存储在整数寄存器中.这不是即时生成和使用的好格式，但是它是最紧凑的格式，因此如果您可以长期保持这种格式的矩阵就有意义. (而且，如果在加载它们时会受到内存带宽的限制.)

This isn't a very good method for using right away with vectors, because you end up with the results in integer registers. This isn't a great format to generate and use on the fly, but it is the most compact so it can make sense if you can keep matrices in this format long-term. (And if you'll be limited by memory bandwidth when loading them.)

将32位整数转换为8位:一种方式是使用2x vpackssdw + vpacksswb.由于这些元素在128b通道内运行，因此您的元素将最终重新排序.但这没关系，只要对每一行/每一列都使用相同的顺序即可.如果您想获取不是以32个元素的倍数开头的行/列的一部分，那么这只是一个问题.此处的另一种选择是将左向量(或8、16和24)左移或与向量在一起.实际上，您可以使用1、2或3个字节的未对齐负载偏移来免费进行移位.

Converting 32-bit integers to 8-bit: One ways is with 2x vpackssdw + vpacksswb. Because those operate within the 128b lanes, your elements will end up reordered. But that's ok as long as it's the same ordering for every row/column. It's only a problem if you want to take a chunk of a row/column that doesn't start at a multiple of 32 elements. Another option here is to left-shift (by 8, by 16, and by 24), and OR vectors together. Actually, you can do the shifting for free by using an unaligned load offset by 1, 2, or 3 bytes.

static inline
__m256i load_interleave4x32(const int32_t *input) {
  const char *p = (const char*)input;
  __m256i t0 = _mm256_load_si256((const __m256i*)(p));
  __m256i t1 = _mm256_load_si256((const __m256i*)(p+32*1-1));  // the 1/0 bits will be in the 2nd byte of each 32-bit element
  __m256i t2 = _mm256_load_si256((const __m256i*)(p+32*2-2));
  __m256i t3 = _mm256_load_si256((const __m256i*)(p+32*3-3));
  return t0 | t1 | t2 | t3;
  // or write this out with _mm256_or_si256, if you don't have overloaded operators like GNU C does.
  // this should compile to 1 load and 3 vpor ymm0, [rdi+31] ... instructions.
}

转换为每个字节半包装的4位:我们可以使用与上述相同的方法.从load_interleave4x32(或从uint8_t的数组开始，如果您从8位元素开始)获取4个向量.将它们左移0、1、2和3位，然后将这些位或在一起.当我们只需要对行/列进行AND和popcount整个结果时，这种交错的位顺序就很好了，因为顺序无关紧要.该位顺序非常有效地将其拆包回有序字节，例如与set1_epi8(1)一起使用将为您提供字节向量.

Converting to half-packed 4 bits per byte: we can use the same idea as above. Get 4 vectors from load_interleave4x32 (or from an array of uint8_t if you started with 8-bit elements). Left-shift them by 0, 1, 2, and 3 bits, and OR those all together. This interleaved bit-order is fine when we just need to AND a row/column and popcount the whole result, because order doesn't matter. This bit-order is fairly efficient to unpack back to in-order bytes, e.g. AND with set1_epi8(1) will get you a vector of bytes.

如果您以这种格式存储整个矩阵，则可以将其用作转置的一部分，或者可以使用此格式为缓存阻止的转置存储临时副本. matmul会多次触摸每一行/每一列，因此，当您第一次使紧凑格式在后续遍历中使每个矢量进行4倍的工作时，可能值得做一些额外的工作.

You might use this as part of a transpose if you store your whole matrices in this format, or you could use this format to store temporary copies for a cache-blocked transpose. A matmul touches each row/column multiple times, so it may be worth doing extra work to make a compact format the first time when that lets you do 4x as much work per vector on subsequent passes.

使用AVX512BW(Skylake-AVX512)

我们真的想对向量进行AND和popcnt运算，而不是对标量整数进行运算，因为向量的宽度是AVX2的两倍，因此它们比标量popcnt更远. (即使Skylake-AVX512在运行512b指令时关闭了端口1上的矢量ALU(但不是标量).)

We really want to be doing the AND and popcnt with vectors, not with scalar integer, because the vectors are twice as wide as AVX2, so they pull further ahead of scalar popcnt. (Even though Skylake-AVX512 shuts down the vector ALUs (but not scalar) on port 1 while running 512b instructions).

，它使我们可以执行2/3的矢量弹出计数，而这需要额外的整数运算.

@Harold points out an interesting identity that lets us do 2/3rds as many vector popcounts, at the cost of extra integer operations.

   popcnt(a) + popcnt(b) + popcnt(c)
 = popcnt(a ^ b ^ c) + 2 * popcnt((a ^ b) & c | (a & b))

a ^ b ^ c和(a ^ b) & c | (a & b)可以使用一个 vpternlogd 每个(因为每个都有3个布尔输入).如果我们使用单独的预移位的vpshufb LUT向量，则2*是免费的.另请参见使用30x vpternlogd的实现+1个矢量popcnt可以处理16个512b矢量，最后进行一些清理(仅在循环内执行16*popcnt计数；其他所有内容都链接在一起).

a ^ b ^ c and (a ^ b) & c | (a & b) can be done with one vpternlogd each (since each have 3 boolean inputs). The 2* is free if we use a separate pre-shifted vpshufb LUT vector. See also this implementation that uses 30x vpternlogd + 1 vector popcnt to handle 16 vectors of 512b, with some cleanup at the end (only doing the 16*popcnt counts inside the loop; everything else is chained).

这对于计数每字节8位全压缩元素非常值得，并且与针对弹出计数优化而无需太多移位/屏蔽的密度较小的格式相比，AVX512更具吸引力.

This is very likely worth it for counting fully-packed 8 bits per byte elements, and makes that format a lot more attractive for AVX512, compared to less-dense formats optimized for popcounting without as much shifting/masking.

vpternlogd也可用作换位的位混合指令.

vpternlogd can also be useful as a bit-blend instruction for transposes, if byte-granularity VPBLENDMB zmm{k1}, zmm, zmm isn't fine-enough grained.

这对于某些CPU上的AVX2可能是值得的，也许避免每4或5个矢量弹出计数中的1个，而不是3个中的1个?否则，如果只是增加总执行端口压力，并且对任何特定端口都没有瓶颈，则可能根本无济于事.对于标量popcnt指令(可能在没有AVX2的CPU上)，这将很有用，因为这些指令确实会在Intel CPU的单个端口上造成瓶颈.

This might be worth it for AVX2 on some CPUs, maybe avoiding 1 out of every 4 or 5 vector popcounts rather than 1 of 3? Or it might not help at all if it just increases the total execution port pressure, and there wasn't a bottleneck on any specific one. It would be useful with scalar popcnt instructions (maybe on CPUs without AVX2), because those do bottleneck on a single port on Intel CPUs.

与AVX2相比，我们可以将uint8_t布尔元素转换为非交错位图的效率略高(甚至不需要移位)，并且反向转换的效率更高.对set1_epi8(1)的向量进行测试掩码或比较掩码都可以完成此工作，从64个字节的输入中产生64位掩码.或者以32位整数开头，一次生成16位掩码.您可以使用kunpck指令有效地连接这些位.

We can turn uint8_t boolean elements into non-interleaved bitmaps slightly more efficiently than AVX2 (without even needing a shift), and do the reverse much more efficiently. Test-into-mask or compare-into-mask against a vector of set1_epi8(1) would both do the job, producing 64 bits of mask from 64 bytes of input. Or with 32-bit integers to start with, producing 16 bits of mask at a time. You can efficiently concatenate those bits with kunpck instructions.

_mm512_test_epi8_mask()很有趣:将两个向量与在一起，并产生一个true/false字节元素的mask-register结果.但这并不是我们真正想要的:如果要打包我们的位，我们希望将其作为对输入矩阵的预处理步骤，而不是在进行内积运算时即时进行.

_mm512_test_epi8_mask (vptestmb) is interesting: AND two vectors together, and produce a mask-register result of byte-elements that were true/false. But this isn't really what we want: if we're going to pack our bits, we want to do it as a pre-processing step on the input matrices, not on the fly while doing inner products.

位图-> 0/-1的向量很快: __m512i _mm512_movm_epi8 (__mmask64 k)(vpmovm2b)只需一条指令即可完成.您可以减去-1而不是添加1，但是您必须先屏蔽它，然后才能将一个字节中的多个位或在一起.

bitmap -> vector of 0 / -1 is fast: __m512i _mm512_movm_epi8 (__mmask64 k) (vpmovm2b) does that in one instruction. You can subtract -1 instead of adding 1, but you'd have to mask it before you can OR together multiple bits within a byte.

没有AVX512BW或AVX512DQ(骑士登陆至强披披)，您没有512b vpshufb，因此无法高效地引导popcnt.有一个直接用于矢量popcnt的 AVX512 popcnt扩展，但没有硬件它甚至还没有宣布. (但是，在KNL上，AVX2 vpshufb ymm非常慢:每12个周期1个，而psadbw ymm是9个周期1个，因此即使使用256b向量也没有吸引力).您可能会使用基于32位整数元素的bithack popcnt，因为那只是AND/shift/ADD . 32位元素比64位元素花费更少的步骤，但仍足够大，不会因合理的问题大小而溢出(因此您可以将向量的水平和延至循环外)

Without AVX512BW or AVX512DQ (Knight's Landing Xeon Phi), you don't have 512b vpshufb so you can't vector popcnt as efficiently. There's an AVX512 popcnt extension for vector popcnt directly, but no hardware with it has even been announced yet. (AVX2 vpshufb ymm is very slow on KNL, though: one per 12 cycles, and psadbw ymm is 1 per 9 cycles, so even using 256b vectors is unattractive). You might use a bithack popcnt based on 32-bit integer elements, since that's just AND/shift/ADD. 32-bit elements will take fewer steps to popcnt than 64-bit, but are still big enough not to overflow for reasonable problem sizes (so you can defer a horizontal sum of the vector until outside a loop)

考虑到存储格式的选择，对于KNL来说，每字节打包多个位可能不是一个好主意，但是单字节整数元素是个好主意. vpandd zmm和vpaddd zmm都是AVX512F的快速组成部分，我们可以使用它们，因为无论如何我们都不希望单字节溢出. (当我们实际上有不会相互携带的8位元素时，使用打包的32位加法是 SWAR 技术.)我认为，相对于Skylake-AVX512，KNL具有良好的内存带宽和较差的指令吞吐量.

Given the choice of storage format, packing multiple bits per byte might not be a good idea for KNL, but single-byte integer elements are good. vpandd zmm and vpaddd zmm are both fast and part of AVX512F, and we can use them because we don't want to let our single-bytes overflow anyway. (Using a packed 32-bit add when we actually have 8-bit elements that won't carry into each other is a SWAR technique.) KNL has good memory bandwidth and poor instruction throughput relative to Skylake-AVX512, I think.

BMI2 _pdep_u64 在这里可能有用.这是标量指令/本征指令.如果它使位转置比将其解压缩到字节高效得多，则您可能需要存储一个转置结果块，然后再将其与AND + count的矢量加载一起重新加载. (在标量存储之后立即重新加载向量会导致存储转发停顿.)

BMI2 _pdep_u64 might be useful here. It's a scalar instruction/intrinsic. If it makes bit-transpose a lot more efficient than unpacking to bytes, you'd probably want to store a block of transpose results before reloading it with vector loads for AND + count. (Reloading a vector right away after scalar stores will cause a store-forwarding stall.)

另一个有用的选择是vpmovmskb可以从32字节向量中切出32位，每个字节一个.这为您提供了转置的构建块，可以与字节混洗结合使用，以正确的顺序获取字节.有关更多信息，请参见此博客帖子，以及如何对二进制矩阵进行转置?.

Another useful option is that vpmovmskb can slice 32 bits out of a 32-byte vector, one per byte. This gives you a building block for a transpose, maybe combined with byte shuffles to get the bytes in the right order for it. For more, see this blog post, and also How would you transpose a binary matrix?.

您的某些选择取决于输入数据的格式，以及重复使用相同矩阵的频率.如果矩阵将被多次使用，则提前将其压缩为每字节4或8位是有意义的. (或在首次使用时即时获得).保留转置副本也很有意义，尤其是当它始终是乘法中需要转置的一面时. (如果有时需要一种方法，有时又需要另一种方法，那么快速进行重做可能会更适合L3缓存占用空间.但是这些足够大，您可能不会受到很多L3命中，因此只需保留转置副本就可以了.好.

Some of your choices depend on what format your input data is in, and how often you will reuse the same matrices. If a matrix will be used multiple times, packing it down to 4 or 8 bits per byte ahead of time makes sense. (Or on the fly the first time it's used). Keeping a transposed copy of it may also make sense, especially if it will always be the side of the multiply that needs transposing. (If you sometimes need one way and sometimes the other, redoing on the fly might be better for L3 cache footprint. But these are big enough that you probably won't get a lot of L3 hits, so just keeping a transposed copy could be good.)

或者甚至在从输入格式转换时写出转置和非转置的版本.

Or maybe even write out a transposed and non-transposed version while converting from your input format.

您肯定希望对乘法进行缓存块，因此在L1处于热状态时，相同的数据将被多次重用.我无话可说. 与缓存阻止普通FP matmul时使用的原理相同，因此请仔细阅读.

You will definitely want to cache-block the multiplies, so the same data is reused multiple times while hot in L1. I don't have anything useful to say about this off the top of my head. The same principles apply as when cache-blocking a normal FP matmul, so go read about that.

在整个列中使用位集&将把这些值放回内存中，然后您将在结果上的.count()中再次遍历它们.我怀疑编译器是否会将其优化为一个单循环，该循环在VPAND结果的每个向量上使用基于VPSHUFB的位切片popcnt，但这会更好. (请参见 http://wm.ite.pl/articles/sse-popcount.html .如果您必须手动对其进行矢量化处理，则需要使用内部函数而不是asm来编写它.)

Using a bitset & for a whole column will put the values back in memory, and then you'll loop over them again in .count() on the result. I doubt that the compiler will optimize this into a one-pass loop that uses a VPSHUFB-based bit-slicing popcnt on each vector of VPAND results, but that would be much better. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)

对于您的矩阵大小，至少内部循环可能会在L1D高速缓存中命中，但是两次循环产生的额外加载/存储指令会增加开销，并且还会干扰有价值数据的预取.

With your matrix sizes, at least that inner loop probably hits in L1D cache, but the extra load/store instructions from looping twice are more overhead and it also interferes with prefetch of the valuable data.

让编译器有效弹出动态尺寸的位图(无需手动矢量化)并不容易.唯一不会吸引的是带有vector<bool>的 clang++ -stdlib=libc++ ，它将std::count(v.begin(), v.end(), true);编译为一个向量化的vpshufb + vpsadbw + vpaddq循环，这非常好.如果只在展开循环内使用vpaddb，并且每次迭代使用一次vpsadbw + vpaddq，将会更快，但是对于自动矢量化的代码来说相当不错.

Getting compilers to efficiently popcnt a dynamically-sized bitmap (without manually vectorizing) is not easy. The only thing that doesn't suck is clang++ -stdlib=libc++ with vector<bool>, which compiles std::count(v.begin(), v.end(), true); to a vectorized vpshufb + vpsadbw + vpaddq loop, which is quite good. It would be faster if it just used vpaddb inside the unrolled loop and vpsadbw + vpaddq once per iteration, but it's pretty good for auto-vectorized code.

g ++的vector<bool>也是位图，但是std::count(v.begin(), v.end(), true);却很糟糕:它使用了一个完全幼稚的循环，一次可以测试1位.而且它甚至没有有效地做到这一点.与clang++相同，默认为libstdc++，而不是新的libc++.

g++'s vector<bool> is also a bitmap, but std::count(v.begin(), v.end(), true); is very bad: it uses a totally naive loop that tests 1 bit at a time. And it doesn't even do that efficiently. Same for clang++ with the default libstdc++ instead of the new libc++.

boost::dynamic_bitset具有.count()成员函数，但没有利用popcnt指令或AVX2.它一次执行一个字节的LUT查找.这比没有libc ++的std::count(vector<bool>)好得多，但是对于HPC来说还不够好.

boost::dynamic_bitset has a .count() member function, but it doesn't take advantage of the popcnt instruction or AVX2. It does a byte-at-a-time LUT lookup. That's much better than std::count(vector<bool>) without libc++, but it's not even close to good enough for HPC.

这是测试代码在Godbolt编译器浏览器上，带有gcc和clang asm输出.他们都使用-march=haswell.

Here's the test code on the Godbolt compiler explorer, with gcc and clang asm output. All of them used -march=haswell.

但是不幸的是，似乎没有一种有效的方法将两个std::vector<bool>按位与. 此答案显示了如何获得g ++ libstdc++ vector<bool>的基础实现，但是该代码不会自动执行-矢量化.对libc++做同样的事情并对其进行调整，以使其自动对 might 进行矢量化处理，可以使您获得手动矢量化所能提供的相当一部分性能(除了转置)，但是您可能不得不将整个矩阵保存在一个vector<bool>中，因为向量的向量是间接的糟糕级别.如果问题的转置部分也对性能至关重要，那么使用标准容器访问有效的popcount并不能解决整个问题.

But unfortunately, there doesn't seem to be an efficient way to bitwise-AND two std::vector<bool>. This answer shows how to get at the underlying implementation of g++'s libstdc++ vector<bool>, but that code doesn't auto-vectorize. Doing the same thing for libc++ and tweaking it so it auto-vectorizes might let you get a good fraction of the performance possible with manual vectorization (except for transpose), but you'd probably have to keep your whole matrix in one vector<bool>, because a vector of vectors is a bad extra level of indirection. If the transpose part of the problem is performance-critical too, using standard containers to get access to an efficient popcount is not going to solve the whole problem.

对于std::bitset<1024*1024>.count()，无论是否使用libc++，clang都会产生相同的有效AVX2 popcount. g ++使用64位popcnt指令进行标量循环，(根据 )在Haswell和Skylake上比小比特集的AVX2 popcnt快一些，但对大比特集的速度要慢一些.

For std::bitset<1024*1024>.count(), clang makes the same efficient AVX2 popcount with or without libc++. g++ makes a scalar loop using the 64-bit popcnt instruction, which (according to this) is somewhat faster than a good AVX2 popcnt for small bitsets, but somewhat slower for large bitsets, on Haswell and Skylake.

另请参见:在vector<bool>上-Howard Hinnant 关于C ++标准库的说明，以及为什么位数组是有用的数据结构，但vector<bool>对此却是一个不好的名字.此外，一些基准用于适当优化的count/find_first/etc.在位向量上与每字节1个bool bool[]数组相比，在天真的vector<bool>上(如您从gcc和clang中获得而没有libc ++).

See also: On vector<bool> — Howard Hinnant, for some commentary on the C++ standard library, and why an array of bits is a useful data structure, but vector<bool> is a bad name for it. Also, some benchmarks for properly-optimized count/find_first/etc. on a bit-vector vs. a 1 bool-per-byte bool[] array, vs. a naive vector<bool> (like you get from gcc and clang without libc++).

这篇关于使用位与AND和popcount而不是实际的int或float乘法的大(0,1)矩阵乘法?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！