有效地从矩阵中减去向量 (Scipy) | 有效地从矩阵中减去向量

本文介绍了有效地从矩阵中减去向量 (Scipy)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个存储为 scipy.sparse.csc_matrix 的大矩阵，我想从大矩阵的每一列中减去一个列向量.当您执行标准化/标准化等操作时，这是一项非常常见的任务，但我似乎无法找到有效执行此操作的正确方法.

I've got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common task when you're doing things like normalization/standardization, but I can't seem to find the proper way to do this efficiently.

这是一个演示示例:

# mat is a 3x3 matrix
mat = scipy.sparse.csc_matrix([[1, 2, 3],
                               [2, 3, 4],
                               [3, 4, 5]])

#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T

"""
I want to subtract `vec` from each of the columns in `mat` yielding...
    [[0, 1, 2],
     [0, 1, 2],
     [0, 1, 2]]
"""

实现我想要的一种方法是将 vec 与自身 hstack 3 次，产生一个 3x3 矩阵，其中每一列都是 vec，然后从 中减去它垫.但同样，我正在寻找一种有效地执行此操作的方法，并且 hstacked 矩阵需要很长时间才能创建.我确信有一些神奇的方法可以通过切片和广播来做到这一点，但它让我望而却步.

One way to accomplish what I want is to hstack vec to itself 3 times, yielding a 3x3 matrix where each column is vec and then subtract that from mat. But again, I'm looking for a way to do this efficiently, and the hstacked matrix takes a long time to create. I'm sure there's some magical way to do this with slicing and broadcasting, but it eludes me.

谢谢！

删除了就地"约束，因为在就地分配方案中，稀疏结构会不断变化.

Removed the 'in-place' constraint, because sparsity structure would be constantly changing in an in-place assignment scenario.

推荐答案

首先，我们会用密集数组做什么?

For a start what would we do with dense arrays?

mat-vec.A # taking advantage of broadcasting
mat-vec.A[:,[0]*3] # explicit broadcasting
mat-vec[:,[0,0,0]] # that also works with csr matrix

在 https://codereview.stackexchange.com/questions/32664/numpy-scipy-优化/33566我们发现在 mat.indptr 向量上使用 as_strided 是遍历稀疏矩阵行的最有效方法.(lil_matrix 的 x.rows、x.cols 几乎一样好.getrow 很慢).该函数实现了迭代等.

In https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566we found that using as_strided on the mat.indptr vector is the most efficient way of stepping through the rows of a sparse matrix. (The x.rows, x.cols of an lil_matrix are nearly as good. getrow is slow). This function implements such as iteration.

def sum(X,v):
    rows, cols = X.shape
    row_start_stop = as_strided(X.indptr, shape=(rows, 2),
                            strides=2*X.indptr.strides)
    for row, (start, stop) in enumerate(row_start_stop):
        data = X.data[start:stop]
        data -= v[row]

sum(mat, vec.A)
print mat.A

为了简单起见，我使用 vec.A.如果我们保持 vec 稀疏，我们就必须在 row 添加一个非零值测试.此外，这种类型的迭代仅修改 mat 的非零元素.0's 不变.

I'm using vec.A for simplicity. If we keep vec sparse we'd have to add a test for nonzero value at row. Also this type of iteration only modifies the nonzero elements of mat. 0's are unchanged.

我怀疑时间优势很大程度上取决于矩阵和向量的稀疏性.如果 vec 有很多零，那么迭代是有意义的，只修改 mat 中 vec 非零的那些行.但是vec像这个例子一样几乎是密集的，可能很难击败mat-vec.A.

I suspect the time advantages will depend a lot on the sparsity of matrix and vector. If vec has lots of zeros, then it makes sense to iterate, modifying only those rows of mat where vec is nonzero. But vec is nearly dense like this example, it may be hard to beat mat-vec.A.

这篇关于有效地从矩阵中减去向量 (Scipy)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！